MyArxiv
Computation and Language
☆ Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration
Large Language Models (LLMs) have demonstrated remarkable capabilities in challenging, knowledge-intensive reasoning tasks. However, extending LLMs to perceive and reason over a new modality (e.g., vision), often requires costly development of large-scale vision language models (VLMs) with LLMs as backbones. Smaller VLMs are more efficient and adaptable but often lack the broad knowledge and reasoning capabilities of frontier LLMs. In this work, we propose BeMyEyes, a modular, multi-agent framework for extending LLMs to multimodal reasoning by orchestrating collaboration between efficient, adaptable VLMs as perceivers and powerful LLMs as reasoners through conversations. We then introduce a data synthesis and supervised fine-tuning pipeline to train the perceiver agent to effectively collaborate with the reasoner agent. By combining the complementary strengths of perception and reasoning agents, BeMyEyes avoids the need for training large-scale multimodal models, preserves the generalization and reasoning capabilities of LLMs, and allows flexible extension to new domains and modalities. Experiments show that our framework unlocks the multimodal reasoning capabilities for LLMs, enabling a lightweight and fully open-source solution, i.e. equipping text-only DeepSeek-R1 with Qwen2.5-VL-7B perceiver, to outperform large-scale proprietary VLMs such as GPT-4o on a wide range of knowledge-intensive multimodal tasks. These results demonstrate the effectiveness, modularity, and scalability of our multi-agent approach for building future multimodal reasoning systems.
☆ DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research
Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.
☆ Scalable Parameter-Light Spectral Method for Clustering Short Text Embeddings with a Cohesion-Based Evaluation Metric
Clustering short text embeddings is a foundational task in natural language processing, yet remains challenging due to the need to specify the number of clusters in advance. We introduce a scalable spectral method that estimates the number of clusters directly from the structure of the Laplacian eigenspectrum, constructed using cosine similarities and guided by an adaptive sampling strategy. This sampling approach enables our estimator to efficiently scale to large datasets without sacrificing reliability. To support intrinsic evaluation of cluster quality without ground-truth labels, we propose the Cohesion Ratio, a simple and interpretable evaluation metric that quantifies how much intra-cluster similarity exceeds the global similarity background. It has an information-theoretic motivation inspired by mutual information, and in our experiments it correlates closely with extrinsic measures such as normalized mutual information and homogeneity. Extensive experiments on six short-text datasets and four modern embedding models show that standard algorithms like K-Means and HAC, when guided by our estimator, significantly outperform popular parameter-light methods such as HDBSCAN, OPTICS, and Leiden. These results demonstrate the practical value of our spectral estimator and Cohesion Ratio for unsupervised organization and evaluation of short text data. Implementation of our estimator of k and Cohesion Ratio, along with code for reproducing the experiments, is available at https://anonymous.4open.science/r/towards_clustering-0C2E.
☆ Learning to Reason: Training LLMs with GPT-OSS or DeepSeek R1 Reasoning Traces
Test-time scaling, which leverages additional computation during inference to improve model accuracy, has enabled a new class of Large Language Models (LLMs) that are able to reason through complex problems by understanding the goal, turning this goal into a plan, working through intermediate steps, and checking their own work before answering . Frontier large language models with reasoning capabilities, such as DeepSeek-R1 and OpenAI's gpt-oss, follow the same procedure when solving complex problems by generating intermediate reasoning traces before giving the final answer. Today, these models are being increasingly used to generate reasoning traces that serve as high-quality supervised data for post-training of small and medium-sized language models to teach reasoning capabilities without requiring expensive human curation. In this work, we compare the performance of medium-sized LLMs on Math problems after post-training on two kinds of reasoning traces. We compare the impact of reasoning traces generated by DeepSeek-R1 and gpt-oss LLMs in terms of accuracy and inference efficiency.
☆ Generative Query Expansion with Multilingual LLMs for Cross-Lingual Information Retrieval
Query expansion is the reformulation of a user query by adding semantically related information, and is an essential component of monolingual and cross-lingual information retrieval used to ensure that relevant documents are not missed. Recently, multilingual large language models (mLLMs) have shifted query expansion from semantic augmentation with synonyms and related words to pseudo-document generation. Pseudo-documents both introduce additional relevant terms and bridge the gap between short queries and long documents, which is particularly beneficial in dense retrieval. This study evaluates recent mLLMs and fine-tuned variants across several generative expansion strategies to identify factors that drive cross-lingual retrieval performance. Results show that query length largely determines which prompting technique is effective, and that more elaborate prompts often do not yield further gains. Substantial linguistic disparities persist: cross-lingual query expansion can produce the largest improvements for languages with the weakest baselines, yet retrieval is especially poor between languages written in different scripts. Fine-tuning is found to lead to performance gains only when the training and test data are of similar format. These outcomes underline the need for more balanced multilingual and cross-lingual training and evaluation resources.
☆ What Drives Cross-lingual Ranking? Retrieval Approaches with Multilingual Language Models
Cross-lingual information retrieval (CLIR) enables access to multilingual knowledge but remains challenging due to disparities in resources, scripts, and weak cross-lingual semantic alignment in embedding models. Existing pipelines often rely on translation and monolingual retrieval heuristics, which add computational overhead and noise, degrading performance. This work systematically evaluates four intervention types, namely document translation, multilingual dense retrieval with pretrained encoders, contrastive learning at word, phrase, and query-document levels, and cross-encoder re-ranking, across three benchmark datasets. We find that dense retrieval models trained specifically for CLIR consistently outperform lexical matching methods and derive little benefit from document translation. Contrastive learning mitigates language biases and yields substantial improvements for encoders with weak initial alignment, and re-ranking can be effective, but depends on the quality of the cross-encoder training data. Although high-resource languages still dominate overall performance, gains over lexical and document-translated baselines are most pronounced for low-resource and cross-script pairs. These findings indicate that cross-lingual search systems should prioritise semantic multilingual embeddings and targeted learning-based alignment over translation-based pipelines, particularly for cross-script and under-resourced languages.
☆ MultiBanAbs: A Comprehensive Multi-Domain Bangla Abstractive Text Summarization Dataset
This study developed a new Bangla abstractive summarization dataset to generate concise summaries of Bangla articles from diverse sources. Most existing studies in this field have concentrated on news articles, where journalists usually follow a fixed writing style. While such approaches are effective in limited contexts, they often fail to adapt to the varied nature of real-world Bangla texts. In today's digital era, a massive amount of Bangla content is continuously produced across blogs, newspapers, and social media. This creates a pressing need for summarization systems that can reduce information overload and help readers understand content more quickly. To address this challenge, we developed a dataset of over 54,000 Bangla articles and summaries collected from multiple sources, including blogs such as Cinegolpo and newspapers such as Samakal and The Business Standard. Unlike single-domain resources, our dataset spans multiple domains and writing styles. It offers greater adaptability and practical relevance. To establish strong baselines, we trained and evaluated this dataset using several deep learning and transfer learning models, including LSTM, BanglaT5-small, and MTS-small. The results highlight its potential as a benchmark for future research in Bangla natural language processing. This dataset provides a solid foundation for building robust summarization systems and helps expand NLP resources for low-resource languages.
☆ PRInTS: Reward Modeling for Long-Horizon Information Seeking
Information-seeking is a core capability for AI agents, requiring them to gather and reason over tool-generated information across long trajectories. However, such multi-step information-seeking tasks remain challenging for agents backed by language models. While process reward models (PRMs) can guide agents by ranking candidate steps at test-time, existing PRMs, designed for short reasoning with binary judgment, cannot capture richer dimensions of information-seeking steps, such as tool interactions and reasoning over tool outputs, nor handle the rapidly growing context in long-horizon tasks. To address these limitations, we introduce PRInTS, a generative PRM trained with dual capabilities: (1) dense scoring based on the PRM's reasoning across multiple step quality dimensions (e.g., interpretation of tool outputs, tool call informativeness) and (2) trajectory summarization that compresses the growing context while preserving essential information for step evaluation. Extensive evaluations across FRAMES, GAIA (levels 1-3), and WebWalkerQA (easy-hard) benchmarks on multiple models, along with ablations, reveal that best-of-n sampling with PRInTS enhances information-seeking abilities of open-source models as well as specialized agents, matching or surpassing the performance of frontier models with a much smaller backbone agent and outperforming other strong reward modeling baselines.
comment: 18 pages, code: https://github.com/G-JWLee/PRInTS
☆ AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning
Humans naturally adapt to diverse environments by learning underlying rules across worlds with different dynamics, observations, and reward structures. In contrast, existing agents typically demonstrate improvements via self-evolving within a single domain, implicitly assuming a fixed environment distribution. Cross-environment learning has remained largely unmeasured: there is no standard collection of controllable, heterogeneous environments, nor a unified way to represent how agents learn. We address these gaps in two steps. First, we propose AutoEnv, an automated framework that treats environments as factorizable distributions over transitions, observations, and rewards, enabling low-cost (4.12 USD on average) generation of heterogeneous worlds. Using AutoEnv, we construct AutoEnv-36, a dataset of 36 environments with 358 validated levels, on which seven language models achieve 12-49% normalized reward, demonstrating the challenge of AutoEnv-36. Second, we formalize agent learning as a component-centric process driven by three stages of Selection, Optimization, and Evaluation applied to an improvable agent component. Using this formulation, we design eight learning methods and evaluate them on AutoEnv-36. Empirically, the gain of any single learning method quickly decrease as the number of environments increases, revealing that fixed learning methods do not scale across heterogeneous environments. Environment-adaptive selection of learning methods substantially improves performance but exhibits diminishing returns as the method space expands. These results highlight both the necessity and the current limitations of agent learning for scalable cross-environment generalization, and position AutoEnv and AutoEnv-36 as a testbed for studying cross-environment agent learning. The code is avaiable at https://github.com/FoundationAgents/AutoEnv.
☆ MapFormer: Self-Supervised Learning of Cognitive Maps with Input-Dependent Positional Embeddings
A cognitive map is an internal model which encodes the abstract relationships among entities in the world, giving humans and animals the flexibility to adapt to new situations, with a strong out-of-distribution (OOD) generalization that current AI systems still do not possess. To bridge this gap, we introduce MapFormers, new architectures based on Transformer models, which can learn cognitive maps from observational data and perform path integration in parallel, in a self-supervised manner. Cognitive maps are learned in the model by disentangling structural relationships in the inputs from their specific content, a property that can be achieved naturally by updating the positional encoding in Transformers with input-dependent matrices. We developed two variants of MapFormers that unify absolute and relative positional encoding to model episodic (EM) and working memory (WM), respectively. We tested MapFormers on several tasks, including a classic 2D navigation task, showing that our models can learn a cognitive map of the underlying space and generalize OOD (e.g., to longer sequences) with near-perfect performance, unlike current architectures. Together, these results demonstrate the superiority of models designed to learn a cognitive map, and the importance of introducing a structural bias for structure-content disentanglement, which can be achieved in Transformers with input-dependent positional encoding. MapFormers have broad applications in both neuroscience and AI, by explaining the neural mechanisms giving rise to cognitive maps, while allowing these relation models to be learned at scale.
comment: 19 pages (29 with appendix), 8 figures
☆ CDLM: Consistency Diffusion Language Models For Faster Sampling
Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.
comment: 18 pages, 6 figures
☆ A Nutrition Multimodal Photoplethysmography Language Model
Hunger and satiety dynamics shape dietary behaviors and metabolic health, yet remain difficult to capture in everyday settings. We present a Nutrition Photoplethysmography Language Model (NPLM), integrating continuous photoplethysmography (PPG) from wearables with meal descriptions. NPLM projects PPG into embeddings interpretable by language models, enabling joint reasoning over physiology and meal context. Trained on 19,340 participants and 1.1 million meal-PPG pairs, the model improved daily caloric intake prediction by 11% over text-only baselines, with accuracy maintained when 80% of meal text was removed. In an independent validation study (n=140) with controlled dining and detailed meal information, the model replicated these findings. These results demonstrate the value of integrating physiological measurements from consumer wearables with meal information for noninvasive dietary monitoring at scale.
comment: 21 pages, 2 figures
☆ In Machina N400: Pinpointing Where a Causal Language Model Detects Semantic Violations
How and where does a transformer notice that a sentence has gone semantically off the rails? To explore this question, we evaluated the causal language model (phi-2) using a carefully curated corpus, with sentences that concluded plausibly or implausibly. Our analysis focused on the hidden states sampled at each model layer. To investigate how violations are encoded, we utilized two complementary probes. First, we conducted a per-layer detection using a linear probe. Our findings revealed that a simple linear decoder struggled to distinguish between plausible and implausible endings in the lowest third of the model's layers. However, its accuracy sharply increased in the middle blocks, reaching a peak just before the top layers. Second, we examined the effective dimensionality of the encoded violation. Initially, the violation widens the representational subspace, followed by a collapse after a mid-stack bottleneck. This might indicate an exploratory phase that transitions into rapid consolidation. Taken together, these results contemplate the idea of alignment with classical psycholinguistic findings in human reading, where semantic anomalies are detected only after syntactic resolution, occurring later in the online processing sequence.
comment: Accepted at AICS2025
☆ RAVEN++: Pinpointing Fine-Grained Violations in Advertisement Videos with Active Reinforcement Reasoning EMNLP 2025
Advertising (Ad) is a cornerstone of the digital economy, yet the moderation of video advertisements remains a significant challenge due to their complexity and the need for precise violation localization. While recent advancements, such as the RAVEN model, have improved coarse-grained violation detection, critical gaps persist in fine-grained understanding, explainability, and generalization. To address these limitations, we propose RAVEN++, a novel framework that introduces three key innovations: 1) Active Reinforcement Learning (RL), which dynamically adapts training to samples of varying difficulty; 2) Fine-Grained Violation Understanding, achieved through hierarchical reward functions and reasoning distillation; and 3) Progressive Multi-Stage Training, which systematically combines knowledge injection, curriculum-based passive RL, and active RL. Extensive experiments on both public and proprietary datasets, on both offline scenarios and online deployed A/B Testing, demonstrate that RAVEN++ outperforms general-purpose LLMs and specialized models like RAVEN in terms of fine-grained violation understanding, reasoning capabilities, and generalization ability.
comment: EMNLP 2025 (Oral, Industry Track)
☆ Representational Stability of Truth in Large Language Models
Large language models (LLMs) are widely used for factual tasks such as "What treats asthma?" or "What is the capital of Latvia?". However, it remains unclear how stably LLMs encode distinctions between true, false, and neither-true-nor-false content in their internal probabilistic representations. We introduce representational stability as the robustness of an LLM's veracity representations to perturbations in the operational definition of truth. We assess representational stability by (i) training a linear probe on an LLM's activations to separate true from not-true statements and (ii) measuring how its learned decision boundary shifts under controlled label changes. Using activations from sixteen open-source models and three factual domains, we compare two types of neither statements. The first are fact-like assertions about entities we believe to be absent from any training data. We call these unfamiliar neither statements. The second are nonfactual claims drawn from well-known fictional contexts. We call these familiar neither statements. The unfamiliar statements induce the largest boundary shifts, producing up to $40\%$ flipped truth judgements in fragile domains (such as word definitions), while familiar fictional statements remain more coherently clustered and yield smaller changes ($\leq 8.2\%$). These results suggest that representational stability stems more from epistemic familiarity than from linguistic form. More broadly, our approach provides a diagnostic for auditing and training LLMs to preserve coherent truth assignments under semantic uncertainty, rather than optimizing for output accuracy alone.
comment: 25 pages, 24 figures
☆ From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation
This paper introduces the retrieval-augmented framework for automatic fashion caption and hashtag generation, combining multi-garment detection, attribute reasoning, and Large Language Model (LLM) prompting. The system aims to produce visually grounded, descriptive, and stylistically interesting text for fashion imagery, overcoming the limitations of end-to-end captioners that have problems with attribute fidelity and domain generalization. The pipeline combines a YOLO-based detector for multi-garment localization, k-means clustering for dominant color extraction, and a CLIP-FAISS retrieval module for fabric and gender attribute inference based on a structured product index. These attributes, together with retrieved style examples, create a factual evidence pack that is used to guide an LLM to generate human-like captions and contextually rich hashtags. A fine-tuned BLIP model is used as a supervised baseline model for comparison. Experimental results show that the YOLO detector is able to obtain a mean Average Precision (mAP@0.5) of 0.71 for nine categories of garments. The RAG-LLM pipeline generates expressive attribute-aligned captions and achieves mean attribute coverage of 0.80 with full coverage at the 50% threshold in hashtag generation, whereas BLIP gives higher lexical overlap and lower generalization. The retrieval-augmented approach exhibits better factual grounding, less hallucination, and great potential for scalable deployment in various clothing domains. These results demonstrate the use of retrieval-augmented generation as an effective and interpretable paradigm for automated and visually grounded fashion content generation.
comment: Submitted to Expert Systems with Applications
☆ Eliciting Chain-of-Thought in Base LLMs via Gradient-Based Representation Optimization AAAI2026
Chain-of-Thought (CoT) reasoning is a critical capability for large language models (LLMs), enabling them to tackle com- plex multi-step tasks. While base LLMs, pre-trained on general text corpora, often struggle with reasoning due to a lack of specialized training, recent studies reveal their latent reason- ing potential tied to hidden states. However, existing hidden state manipulation methods, such as linear activation steering, suffer from limitations due to their rigid and unconstrained nature, often leading to distribution shifts and degraded text quality. In this work, we propose a novel approach for elic- iting CoT reasoning from base LLMs through hidden state manipulation grounded in probabilistic conditional generation. By reformulating the challenge as an optimization problem with a balanced likelihood and prior regularization framework, our method guides hidden states toward reasoning-oriented trajectories while preserving linguistic coherence. Extensive evaluations across mathematical, commonsense, and logical reasoning benchmarks demonstrate that our approach con- sistently outperforms existing steering methods, offering a theoretically principled and effective solution for enhancing reasoning capabilities in base LLMs.
comment: AAAI2026
☆ Emotion-Enhanced Multi-Task Learning with LLMs for Aspect Category Sentiment Analysis
Aspect category sentiment analysis (ACSA) has achieved remarkable progress with large language models (LLMs), yet existing approaches primarily emphasize sentiment polarity while overlooking the underlying emotional dimensions that shape sentiment expressions. This limitation hinders the model's ability to capture fine-grained affective signals toward specific aspect categories. To address this limitation, we introduce a novel emotion-enhanced multi-task ACSA framework that jointly learns sentiment polarity and category-specific emotions grounded in Ekman's six basic emotions. Leveraging the generative capabilities of LLMs, our approach enables the model to produce emotional descriptions for each aspect category, thereby enriching sentiment representations with affective expressions. Furthermore, to ensure the accuracy and consistency of the generated emotions, we introduce an emotion refinement mechanism based on the Valence-Arousal-Dominance (VAD) dimensional framework. Specifically, emotions predicted by the LLM are projected onto a VAD space, and those inconsistent with their corresponding VAD coordinates are re-annotated using a structured LLM-based refinement strategy. Experimental results demonstrate that our approach significantly outperforms strong baselines on all benchmark datasets. This underlines the effectiveness of integrating affective dimensions into ACSA.
comment: 8 pages, 4 figures
☆ On the Optimality of Discrete Object Naming: a Kinship Case Study
The structure of naming systems in natural languages hinges on a trade-off between high informativeness and low complexity. Prior work capitalizes on information theory to formalize these notions; however, these studies generally rely on two simplifications: (i) optimal listeners, and (ii) universal communicative need across languages. Here, we address these limitations by introducing an information-theoretic framework for discrete object naming systems, and we use it to prove that an optimal trade-off is achievable if and only if the listener's decoder is equivalent to the Bayesian decoder of the speaker. Adopting a referential game setup from emergent communication, and focusing on the semantic domain of kinship, we show that our notion of optimality is not only theoretically achievable but also emerges empirically in learned communication systems.
☆ A symbolic Perl algorithm for the unification of Nahuatl word spellings
In this paper, we describe a symbolic model for the automatic orthographic unification of Nawatl text documents. Our model is based on algorithms that we have previously used to analyze sentences in Nawatl, and on the corpus called $π$-yalli, consisting of texts in several Nawatl orthographies. Our automatic unification algorithm implements linguistic rules in symbolic regular expressions. We also present a manual evaluation protocol that we have proposed and implemented to assess the quality of the unified sentences generated by our algorithm, by testing in a sentence semantic task. We have obtained encouraging results from the evaluators for most of the desired features of our artificially unified sentences
comment: MICAI 2025, LNAI 16221, pp. 141-154, 2026. 10 pages, 4 Figures, 8 Tables
☆ A Multi-Agent LLM Framework for Multi-Domain Low-Resource In-Context NER via Knowledge Retrieval, Disambiguation and Reflective Analysis AAAI 2026
In-context learning (ICL) with large language models (LLMs) has emerged as a promising paradigm for named entity recognition (NER) in low-resource scenarios. However, existing ICL-based NER methods suffer from three key limitations: (1) reliance on dynamic retrieval of annotated examples, which is problematic when annotated data is scarce; (2) limited generalization to unseen domains due to the LLM's insufficient internal domain knowledge; and (3) failure to incorporate external knowledge or resolve entity ambiguities. To address these challenges, we propose KDR-Agent, a novel multi-agent framework for multi-domain low-resource in-context NER that integrates Knowledge retrieval, Disambiguation, and Reflective analysis. KDR-Agent leverages natural-language type definitions and a static set of entity-level contrastive demonstrations to reduce dependency on large annotated corpora. A central planner coordinates specialized agents to (i) retrieve factual knowledge from Wikipedia for domain-specific mentions, (ii) resolve ambiguous entities via contextualized reasoning, and (iii) reflect on and correct model predictions through structured self-assessment. Experiments across ten datasets from five domains demonstrate that KDR-Agent significantly outperforms existing zero-shot and few-shot ICL baselines across multiple LLM backbones. The code and data can be found at https://github.com/MWXGOD/KDR-Agent.
comment: This paper has been accepted by AAAI 2026 (Main Technical Track)
☆ GraphMind: Theorem Selection and Conclusion Generation Framework with Dynamic GNN for LLM Reasoning
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, including multi-step reasoning such as mathematical proving. However, existing approaches often lack an explicit and dynamic mechanism to structurally represent and evolve intermediate reasoning states, which limits their ability to perform context-aware theorem selection and iterative conclusion generation. To address these challenges, we propose GraphMind, a novel dynamic graph-based framework that integrates the graph neural network (GNN) with LLMs to iteratively select theorems and generate intermediate conclusions for multi-step reasoning. Our method models the reasoning process as a heterogeneous evolving graph, where nodes represent conditions, theorems, and conclusions, while edges capture logical dependencies between nodes. By encoding the current reasoning state with GNN and leveraging semantic matching for theorem selection, our framework enables context-aware, interpretable, and structured reasoning in a closed-loop manner. Experiments on various question-answering (QA) datasets demonstrate that our proposed GraphMind method achieves consistent performance improvements and significantly outperforms existing baselines in multi-step reasoning, validating the effectiveness and generalizability of our approach.
☆ Logic of Montage
In expressing emotions, as an expression form separate from natural language, we propose an alternative form that complements natural language, acting as a proxy or window for emotional states. First, we set up an expression form "Effect of Contradictory Structure." "Effect of Contradictory Structure" is not static but dynamic. Effect in "Effect of Contradictory Structure" is unpleasant or pleasant, and the orientation to avoid that unpleasantness is considered pseudo-expression of will. Second, "Effect of Contradictory Structure" can be overlapped with each other. This overlapping operation is called "montage." A broader "Structure" that includes related "Effect of Contradictory Structure" and "Effect of Structure" are set up. Montage produces "Effect of Structure". In montage, it is necessary to set something like "strength," so we adopted Deleuze and Deleuze/Guattari's word "intensity" and set it as an element of our model. We set up a general theoretical framework - Word Import Between Systems (Models) and justified the import of "intensity" through Austin's use of the word "force." "Effect of Structure" process is demonstrated using the example of proceeding to the next level of education.
☆ Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation
Large language models demonstrate powerful capabilities across various natural language processing tasks, yet they also harbor safety vulnerabilities. To enhance LLM safety, various jailbreak defense methods have been proposed to guard against harmful outputs. However, improvements in model safety often come at the cost of severe over-refusal, failing to strike a good balance between safety and usability. In this paper, we first analyze the causes of over-refusal from a representation perspective, revealing that over-refusal samples reside at the boundary between benign and malicious samples. Based on this, we propose MOSR, designed to mitigate over-refusal by intervening the safety representation of LLMs. MOSR incorporates two novel components: (1) Overlap-Aware Loss Weighting, which determines the erasure weight for malicious samples by quantifying their similarity to pseudo-malicious samples in the representation space, and (2) Context-Aware Augmentation, which supplements the necessary context for rejection decisions by adding harmful prefixes before rejection responses. Experiments demonstrate that our method outperforms existing approaches in mitigating over-refusal while largely maintaining safety. Overall, we advocate that future defense methods should strike a better balance between safety and over-refusal.
☆ Classification EM-PCA for clustering and embedding
The mixture model is undoubtedly one of the greatest contributions to clustering. For continuous data, Gaussian models are often used and the Expectation-Maximization (EM) algorithm is particularly suitable for estimating parameters from which clustering is inferred. If these models are particularly popular in various domains including image clustering, they however suffer from the dimensionality and also from the slowness of convergence of the EM algorithm. However, the Classification EM (CEM) algorithm, a classifying version, offers a fast convergence solution while dimensionality reduction still remains a challenge. Thus we propose in this paper an algorithm combining simultaneously and non-sequentially the two tasks --Data embedding and Clustering-- relying on Principal Component Analysis (PCA) and CEM. We demonstrate the interest of such approach in terms of clustering and data embedding. We also establish different connections with other clustering approaches.
comment: Accepted at the IEEE conference on Big Data (Special Session on Machine Learning)
☆ Knowledge-based Graphical Method for Safety Signal Detection in Clinical Trials
We present a graphical, knowledge-based method for reviewing treatment-emergent adverse events (AEs) in clinical trials. The approach enhances MedDRA by adding a hidden medical knowledge layer (Safeterm) that captures semantic relationships between terms in a 2-D map. Using this layer, AE Preferred Terms can be regrouped automatically into similarity clusters, and their association to the trial disease may be quantified. The Safeterm map is available online and connected to aggregated AE incidence tables from ClinicalTrials.gov. For signal detection, we compute treatment-specific disproportionality metrics using shrinkage incidence ratios. Cluster-level EBGM values are then derived through precision-weighted aggregation. Two visual outputs support interpretation: a semantic map showing AE incidence and an expectedness-versus-disproportionality plot for rapid signal detection. Applied to three legacy trials, the automated method clearly recovers all expected safety signals. Overall, augmenting MedDRA with a medical knowledge layer improves clarity, efficiency, and accuracy in AE interpretation for clinical trials.
comment: 13 pages, 3 tables, 5 figures
☆ SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression
Large Language Models (LLMs) face a significant bottleneck during autoregressive inference due to the massive memory footprint of the Key-Value (KV) cache. Existing compression techniques like token eviction, quantization, or other low-rank methods often risk information loss, have fixed limits, or introduce significant computational overhead from explicit decompression steps. In this work, we introduce SWAN, a novel, fine-tuning-free framework that eliminates this overhead. Our method uses an offline orthogonal matrix to rotate and prune the KV-cache, which is then used directly in the attention computation without any reconstruction. Our extensive experiments demonstrate that SWAN, augmented with a small dense buffer, offers a robust trade-off, maintaining performance close to the uncompressed baseline even at aggressive 50-60% memory savings per-token on KV-cache. A key advantage is its runtime-tunable compression level, allowing operators to dynamically adjust the memory footprint, a flexibility absent in methods requiring fixed offline configurations. This combination of a decompression-free design, high performance under compression, and adaptability makes SWAN a practical and efficient solution for serving LLMs with long contexts.
☆ Skeletons Matter: Dynamic Data Augmentation for Text-to-Query EMNLP 2025
The task of translating natural language questions into query languages has long been a central focus in semantic parsing. Recent advancements in Large Language Models (LLMs) have significantly accelerated progress in this field. However, existing studies typically focus on a single query language, resulting in methods with limited generalizability across different languages. In this paper, we formally define the Text-to-Query task paradigm, unifying semantic parsing tasks across various query languages. We identify query skeletons as a shared optimization target of Text-to-Query tasks, and propose a general dynamic data augmentation framework that explicitly diagnoses model-specific weaknesses in handling these skeletons to synthesize targeted training data. Experiments on four Text-to-Query benchmarks demonstrate that our method achieves state-of-the-art performance using only a small amount of synthesized data, highlighting the efficiency and generality of our approach and laying a solid foundation for unified research on Text-to-Query tasks. We release our code at https://github.com/jjjycaptain/Skeletron.
comment: Accepted at EMNLP 2025
☆ Look It Up: Analysing Internal Web Search Capabilities of Modern LLMs
Modern large language models integrate web search to provide real-time answers, yet it remains unclear whether they are efficiently calibrated to use search when it is actually needed. We introduce a benchmark evaluating both the necessity and effectiveness of web access across commercial models with no access to internal states or parameters. The dataset includes a static split of 783 temporally anchored questions answerable from pre-cutoff knowledge, aimed at testing whether models invoke search based on low internal confidence, and a dynamic split of 288 post-cutoff queries designed to test whether models recognise when search is required and retrieve updated information. Web access substantially improves static accuracy for GPT-5-mini and Claude Haiku 4.5, though confidence calibration worsens. On dynamic queries, both models frequently invoke search yet remain below 70 percent accuracy due to weak query formulation. Costs per accuracy-improving call remain low, but returns diminish once initial retrieval fails. Selective invocation helps, but models become overconfident and inconsistent after search. Overall, built-in web search meaningfully improves factual accuracy and can be invoked selectively, yet models remain overconfident, skip retrieval when it is essential, and falter once initial search queries underperform. Taken together, internal web search works better as a good low-latency verification layer than a reliable analytical tool, with clear room for improvement.
comment: 10 pages, 8 figures
☆ How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining
Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.
☆ Reproducibility Study of Large Language Model Bayesian Optimization ICLR 2024
In this reproducibility study, we revisit the LLAMBO framework of Daxberger et al. (2024), a prompting-based Bayesian optimization (BO) method that uses large language models as discriminative surrogates and acquisition optimizers via text-only interactions. We replicate the core Bayesmark and HPOBench experiments under the original evaluation protocol, but replace GPT-3.5 with the open-weight Llama 3.1 70B model used for all text encoding components. Our results broadly confirm the main claims of LLAMBO. Contextual warm starting via textual problem and hyperparameter descriptions substantially improves early regret behaviour and reduces variance across runs. LLAMBO's discriminative surrogate is weaker than GP or SMAC as a pure single task regressor, yet benefits from cross task semantic priors induced by the language model. Ablations that remove textual context markedly degrade predictive accuracy and calibration, while the LLAMBO candidate sampler consistently generates higher quality and more diverse proposals than TPE or random sampling. Experiments with smaller backbones (Gemma 27B, Llama 3.1 8B) yield unstable or invalid predictions, suggesting insufficient capacity for reliable surrogate behaviour. Overall, our study shows that the LLAMBO architecture is robust to changing the language model backbone and remains effective when instantiated with Llama 3.1 70B.
comment: 7 pages, 8 figures. Reproducibility study of the LLAMBO framework (ICLR 2024). Code: https://github.com/spagnoloG/llambo-reproducibility
☆ CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation ACL'25
Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training. Current studies attempt to mitigate this issue by modifying existing datasets or generating new ones from freshly collected information. However, these methods fall short of ensuring contamination-resilient evaluation, as they fail to fully eliminate pre-existing knowledge from models or preserve the semantic complexity of the original datasets. To address these limitations, we propose \textbf{CoreEval}, a \textbf{Co}ntamination-\textbf{re}silient \textbf{Eval}uation strategy for automatically updating data with real-world knowledge. This approach begins by extracting entity relationships from the original data and leveraging the GDELT database to retrieve relevant, up-to-date knowledge. The retrieved knowledge is then recontextualized and integrated with the original data, which is refined and restructured to ensure semantic coherence and enhanced task relevance. Ultimately, a robust data reflection mechanism is employed to iteratively verify and refine labels, ensuring consistency between the updated and original datasets. Extensive experiments on updated datasets validate the robustness of CoreEval, demonstrating its effectiveness in mitigating performance overestimation caused by data contamination.
comment: ACL'25
☆ Think Before You Prune: Selective Self-Generated Calibration for Pruning Large Reasoning Models
Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning benchmarks. However, their long chain-of-thought reasoning processes incur significant inference overhead. Pruning has emerged as a promising approach to reducing computational costs. However, existing efforts have primarily focused on large language models (LLMs), while pruning LRMs remains unexplored. In this work, we conduct the first empirical study on pruning LRMs and show that directly applying existing pruning techniques fails to yield satisfactory results. Our findings indicate that using self-generated reasoning data for calibration can substantially improve pruning performance. We further investigate how the difficulty and length of reasoning data affect pruning outcomes. Our analysis reveals that challenging and moderately long self-generated reasoning data serve as ideal calibration data. Based on these insights, we propose a Selective Self-Generated Reasoning (SSGR) data construction strategy to provide effective calibration data for pruning LRMs. Experimental results on the DeepSeek-R1-Distill model series validate that our strategy improves the reasoning ability of pruned LRMs by 10%-13% compared to general pruning methods.
comment: Under Review
☆ Generating Reading Comprehension Exercises with Large Language Models for Educational Applications
With the rapid development of large language models (LLMs), the applications of LLMs have grown substantially. In the education domain, LLMs demonstrate significant potential, particularly in automatic text generation, which enables the creation of intelligent and adaptive learning content. This paper proposes a new LLMs framework, which is named as Reading Comprehension Exercise Generation (RCEG). It can generate high-quality and personalized English reading comprehension exercises automatically. Firstly, RCEG uses fine-tuned LLMs to generate content candidates. Then, it uses a discriminator to select the best candidate. Finally, the quality of the generated content has been improved greatly. To evaluate the performance of RCEG, a dedicated dataset for English reading comprehension is constructed to perform the experiments, and comprehensive evaluation metrics are used to analyze the experimental results. These metrics include content diversity, factual accuracy, linguistic toxicity, and pedagogical alignment. Experimental results show that RCEG significantly improves the relevance and cognitive appropriateness of the generated exercises.
☆ FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models
Content moderation filters are a critical safeguard against alignment failures in language models. Yet most existing filters focus narrowly on general safety and overlook cultural context. In this work, we introduce FanarGuard, a bilingual moderation filter that evaluates both safety and cultural alignment in Arabic and English. We construct a dataset of over 468K prompt and response pairs, drawn from synthetic and public datasets, scored by a panel of LLM judges on harmlessness and cultural awareness, and use it to train two filter variants. To rigorously evaluate cultural alignment, we further develop the first benchmark targeting Arabic cultural contexts, comprising over 1k norm-sensitive prompts with LLM-generated responses annotated by human raters. Results show that FanarGuard achieves stronger agreement with human annotations than inter-annotator reliability, while matching the performance of state-of-the-art filters on safety benchmarks. These findings highlight the importance of integrating cultural awareness into moderation and establish FanarGuard as a practical step toward more context-sensitive safeguards.
☆ Cognitive Alpha Mining via LLM-Driven Code-Based Evolution
Discovering effective predictive signals, or ``alphas,'' from financial data with high dimensionality and extremely low signal-to-noise ratio remains a difficult open problem. Despite progress in deep learning, genetic programming, and, more recently, large language model (LLM)--based factor generation, existing approaches still explore only a narrow region of the vast alpha search space. Neural models tend to produce opaque and fragile patterns, while symbolic or formula-based methods often yield redundant or economically ungrounded expressions that generalize poorly. Although different in form, these paradigms share a key limitation: none can conduct broad, structured, and human-like exploration that balances logical consistency with creative leaps. To address this gap, we introduce the Cognitive Alpha Mining Framework (CogAlpha), which combines code-level alpha representation with LLM-driven reasoning and evolutionary search. Treating LLMs as adaptive cognitive agents, our framework iteratively refines, mutates, and recombines alpha candidates through multi-stage prompts and financial feedback. This synergistic design enables deeper thinking, richer structural diversity, and economically interpretable alpha discovery, while greatly expanding the effective search space. Experiments on A-share equities demonstrate that CogAlpha consistently discovers alphas with superior predictive accuracy, robustness, and generalization over existing methods. Our results highlight the promise of aligning evolutionary optimization with LLM-based reasoning for automated and explainable alpha discovery. All source code will be released.
☆ Large Language Models for the Summarization of Czech Documents: From History to the Present
Text summarization is the task of automatically condensing longer texts into shorter, coherent summaries while preserving the original meaning and key information. Although this task has been extensively studied in English and other high-resource languages, Czech summarization, particularly in the context of historical documents, remains underexplored. This is largely due to the inherent linguistic complexity of Czech and the lack of high-quality annotated datasets. In this work, we address this gap by leveraging the capabilities of Large Language Models (LLMs), specifically Mistral and mT5, which have demonstrated strong performance across a wide range of natural language processing tasks and multilingual settings. In addition, we also propose a translation-based approach that first translates Czech texts into English, summarizes them using an English-language model, and then translates the summaries back into Czech. Our study makes the following main contributions: We demonstrate that LLMs achieve new state-of-the-art results on the SumeCzech dataset, a benchmark for modern Czech text summarization, showing the effectiveness of multilingual LLMs even for morphologically rich, medium-resource languages like Czech. We introduce a new dataset, Posel od Čerchova, designed for the summarization of historical Czech texts. This dataset is derived from digitized 19th-century publications and annotated for abstractive summarization. We provide initial baselines using modern LLMs to facilitate further research in this underrepresented area. By combining cutting-edge models with both modern and historical Czech datasets, our work lays the foundation for further progress in Czech summarization and contributes valuable resources for future research in Czech historical document processing and low-resource summarization more broadly.
☆ A Reproducible Framework for Neural Topic Modeling in Focus Group Analysis
Focus group discussions generate rich qualitative data but their analysis traditionally relies on labor-intensive manual coding that limits scalability and reproducibility. We present a rigorous, reproducible computational framework for applying neural topic modeling to focus group transcripts, addressing fundamental methodological challenges: hyperparameter sensitivity, model stability, and validation of interpretability. Using BERTopic applied to ten focus groups exploring HPV vaccine perceptions in Tunisia (1,076 utterances), we conducted systematic evaluation across 27 hyperparameter configurations, assessed stability through bootstrap resampling with 30 replicates per configuration, and validated interpretability through formal human evaluation by three domain experts. Our analysis demonstrates substantial sensitivity to hyperparameter choices and reveals that metric selection for stability assessment must align with analytical goals. A hierarchical merging strategy (extracting fine-grained topics for stability then consolidating for interpretability) effectively navigates the stability-coherence tradeoff, achieving coherence of 0.558 compared to 0.539 for direct extraction. Human validation confirmed topic quality with very good inter-rater reliability (ICC = 0.79, weighted Cohen's kappa = 0.578). Our framework provides practical guidelines that researchers can adapt to their own qualitative research contexts. All code, data processing scripts, and evaluation protocols are publicly available to support reproduction and extension of this work.
☆ Concept than Document: Context Compression via AMR-based Conceptual Entropy
Large Language Models (LLMs) face information overload when handling long contexts, particularly in Retrieval-Augmented Generation (RAG) where extensive supporting documents often introduce redundant content. This issue not only weakens reasoning accuracy but also increases computational overhead. We propose an unsupervised context compression framework that exploits Abstract Meaning Representation (AMR) graphs to preserve semantically essential information while filtering out irrelevant text. By quantifying node-level entropy within AMR graphs, our method estimates the conceptual importance of each node, enabling the retention of core semantics. Specifically, we construct AMR graphs from raw contexts, compute the conceptual entropy of each node, and screen significant informative nodes to form a condensed and semantically focused context than raw documents. Experiments on the PopQA and EntityQuestions datasets show that our method outperforms vanilla and other baselines, achieving higher accuracy while substantially reducing context length. To the best of our knowledge, this is the first work introducing AMR-based conceptual entropy for context compression, demonstrating the potential of stable linguistic features in context engineering.
☆ Assessing the alignment between infants' visual and linguistic experience using multimodal language models
Figuring out which objects or concepts words refer to is a central language learning challenge for young children. Most models of this process posit that children learn early object labels from co-occurrences of words and their referents that occur when someone around them talks about an object in the immediate physical environment. But how aligned in time are children's visual and linguistic experiences during everyday learning? To date, answers to this question have been limited by the need for labor-intensive manual annotations of vision-language co-occurrences. Here, we evaluate the use of contrastive language-image pretraining (CLIP) models to automatically characterize vision-language alignment in egocentric videos taken from the infant perspective in home environments. After validating CLIP alignment scores using human alignment judgments, we apply this metric to a large corpus of infant-perspective videos. We show that idealized aligned moments for learning (e.g., "look at the ball" with a ball present in the child's view) are relatively rare in children's everyday experiences compared to modern machine learning datasets, and highlight variability in alignment both within and across children. These findings suggest that infrequent alignment is a constraint for models describing early word learning and offer a new method for investigating children's multimodal environment.
☆ HyperbolicRAG: Enhancing Retrieval-Augmented Generation with Hyperbolic Representations
Retrieval-augmented generation (RAG) enables large language models (LLMs) to access external knowledge, helping mitigate hallucinations and enhance domain-specific expertise. Graph-based RAG enhances structural reasoning by introducing explicit relational organization that enables information propagation across semantically connected text units. However, these methods typically rely on Euclidean embeddings that capture semantic similarity but lack a geometric notion of hierarchical depth, limiting their ability to represent abstraction relationships inherent in complex knowledge graphs. To capture both fine-grained semantics and global hierarchy, we propose HyperbolicRAG, a retrieval framework that integrates hyperbolic geometry into graph-based RAG. HyperbolicRAG introduces three key designs: (1) a depth-aware representation learner that embeds nodes within a shared Poincare manifold to align semantic similarity with hierarchical containment, (2) an unsupervised contrastive regularization that enforces geometric consistency across abstraction levels, and (3) a mutual-ranking fusion mechanism that jointly exploits retrieval signals from Euclidean and hyperbolic spaces, emphasizing cross-space agreement during inference. Extensive experiments across multiple QA benchmarks demonstrate that HyperbolicRAG outperforms competitive baselines, including both standard RAG and graph-augmented baselines.
comment: 12 pages
☆ Context-Aware Whisper for Arabic ASR Under Linguistic Varieties
Low-resource ASR remains a challenging problem, especially for languages like Arabic that exhibit wide dialectal variation and limited labeled data. We propose context-aware prompting strategies to adapt OpenAI's Whisper for Arabic speech recognition without retraining. Our methods include decoder prompting with first-pass transcriptions or retrieved utterances, and encoder prefixing using speech synthesized in the target speaker's voice. We introduce techniques such as prompt reordering, speaker-aware prefix synthesis, and modality-specific retrieval (lexical, semantic, acoustic) to improve transcription in real-world, zero-shot settings. Evaluated on nine Arabic linguistic conditions, our approach reduces WER by up to 22.3% on Modern Standard Arabic and 9.2% on dialectal speech, significantly mitigating hallucinations and speaker mismatch.
☆ Robust Multimodal Sentiment Analysis with Distribution-Based Feature Recovery and Fusion ACM MM 2024
As posts on social media increase rapidly, analyzing the sentiments embedded in image-text pairs has become a popular research topic in recent years. Although existing works achieve impressive accomplishments in simultaneously harnessing image and text information, they lack the considerations of possible low-quality and missing modalities. In real-world applications, these issues might frequently occur, leading to urgent needs for models capable of predicting sentiment robustly. Therefore, we propose a Distribution-based feature Recovery and Fusion (DRF) method for robust multimodal sentiment analysis of image-text pairs. Specifically, we maintain a feature queue for each modality to approximate their feature distributions, through which we can simultaneously handle low-quality and missing modalities in a unified framework. For low-quality modalities, we reduce their contributions to the fusion by quantitatively estimating modality qualities based on the distributions. For missing modalities, we build inter-modal mapping relationships supervised by samples and distributions, thereby recovering the missing modalities from available ones. In experiments, two disruption strategies that corrupt and discard some modalities in samples are adopted to mimic the low-quality and missing modalities in various real-world scenarios. Through comprehensive experiments on three publicly available image-text datasets, we demonstrate the universal improvements of DRF compared to SOTA methods under both two strategies, validating its effectiveness in robust multimodal sentiment analysis.
comment: Accepted by ACM MM 2024
☆ Large Language Models Require Curated Context for Reliable Political Fact-Checking -- Even with Reasoning and Web Search
Large language models (LLMs) have raised hopes for automated end-to-end fact-checking, but prior studies report mixed results. As mainstream chatbots increasingly ship with reasoning capabilities and web search tools -- and millions of users already rely on them for verification -- rigorous evaluation is urgent. We evaluate 15 recent LLMs from OpenAI, Google, Meta, and DeepSeek on more than 6,000 claims fact-checked by PolitiFact, comparing standard models with reasoning- and web-search variants. Standard models perform poorly, reasoning offers minimal benefits, and web search provides only moderate gains, despite fact-checks being available on the web. In contrast, a curated RAG system using PolitiFact summaries improved macro F1 by 233% on average across model variants. These findings suggest that giving models access to curated high-quality context is a promising path for automated fact-checking.
☆ RhinoInsight: Improving Deep Research through Control Mechanisms for Model Behavior and Context
Large language models are evolving from single-turn responders into tool-using agents capable of sustained reasoning and decision-making for deep research. Prevailing systems adopt a linear pipeline of plan to search to write to a report, which suffers from error accumulation and context rot due to the lack of explicit control over both model behavior and context. We introduce RhinoInsight, a deep research framework that adds two control mechanisms to enhance robustness, traceability, and overall quality without parameter updates. First, a Verifiable Checklist module transforms user requirements into traceable and verifiable sub-goals, incorporates human or LLM critics for refinement, and compiles a hierarchical outline to anchor subsequent actions and prevent non-executable planning. Second, an Evidence Audit module structures search content, iteratively updates the outline, and prunes noisy context, while a critic ranks and binds high-quality evidence to drafted content to ensure verifiability and reduce hallucinations. Our experiments demonstrate that RhinoInsight achieves state-of-the-art performance on deep research tasks while remaining competitive on deep search tasks.
☆ Empathetic Cascading Networks: A Multi-Stage Prompting Technique for Reducing Social Biases in Large Language Models
This report presents the Empathetic Cascading Networks (ECN) framework, a multi-stage prompting method designed to enhance the empathetic and inclusive capabilities of large language models. ECN employs four stages: Perspective Adoption, Emotional Resonance, Reflective Understanding, and Integrative Synthesis, to guide models toward generating emotionally resonant and contextually aware responses. Experimental results demonstrate that ECN achieves the highest Empathy Quotient (EQ) scores across GPT-3.5-turbo and GPT-4, while maintaining competitive Regard and Perplexity metrics. These findings emphasize ECN's potential for applications requiring empathy and inclusivity in conversational AI.
☆ CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning
Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval-generation optimization. In this work, we propose CLaRa (Continuous Latent Reasoning), a unified framework that performs embedding-based compression and joint optimization in a shared continuous space. To obtain semantically rich and retrievable compressed vectors, we introduce SCP, a key-preserving data synthesis framework using QA and paraphrase supervision. CLaRa then trains the reranker and generator end-to-end via a single language modeling loss, with gradients flowing through both modules using a differentiable top-k estimator. Theoretically, this unified optimization aligns retrieval relevance with answer quality. Experiments across multiple QA benchmarks show that CLaRa achieves state-of-the-art compression and reranking performance, often surpassing text-based fine-tuned baselines.
♻ ☆ MiniF2F in Rocq: Automatic Translation Between Proof Assistants -- A Case Study
In this work, we conduct an experiment using state-of-the-art LLMs to translate MiniF2F into Rocq. The translation task focuses on generating a Rocq theorem based on three sources: a natural language description, the Lean formalization, and the Isabelle formalization. We conducted our experiment in 3 stages of increasing complexity, from basic one-shot prompting to multi-turn conversations that incorporate feedback from unsuccessful attempts. At each stage, we perform multiple rounds of translation using increasingly advanced models: GPT-4o mini, Claude 3.5 Sonnet, o1 mini, and o1. We successfully translated 478 out of 488 theorems. The dataset is opensource: https://github.com/LLM4Rocq/miniF2F-rocq.
♻ ☆ Information Extraction From Fiscal Documents Using LLMs
Large Language Models (LLMs) have demonstrated remarkable capabilities in text comprehension, but their ability to process complex, hierarchical tabular data remains underexplored. We present a novel approach to extracting structured data from multi-page government fiscal documents using LLM-based techniques. Applied to annual fiscal documents from the State of Karnataka in India (200+ pages), our method achieves high accuracy through a multi-stage pipeline that leverages domain knowledge, sequential context, and algorithmic validation. A large challenge with traditional OCR methods is the inability to verify the accurate extraction of numbers. When applied to fiscal data, the inherent structure of fiscal tables, with totals at each level of the hierarchy, allows for robust internal validation of the extracted data. We use these hierarchical relationships to create multi-level validation checks. We demonstrate that LLMs can read tables and also process document-specific structural hierarchies, offering a scalable process for converting PDF-based fiscal disclosures into research-ready databases. Our implementation shows promise for broader applications across developing country contexts.
comment: 6 pages. Presented at the AI for Financial Inclusion, Risk Modeling and Resilience in Emerging Markets workshop at ACM ICAIF 2025 Singapore
♻ ☆ PEANuT: Parameter-Efficient Adaptation with Weight-aware Neural Tweakers
Fine-tuning large pre-trained foundation models often yields excellent downstream performance but is prohibitively expensive when updating all parameters. Parameter-efficient fine-tuning (PEFT) methods such as LoRA alleviate this by introducing lightweight update modules, yet they commonly rely on weight-agnostic linear approximations, limiting their expressiveness. In this work, we propose PEANuT, a novel PEFT framework that introduces weight-aware neural tweakers, compact neural modules that generate task-adaptive updates conditioned on frozen pre-trained weights. PEANuT provides a flexible yet efficient way to capture complex update patterns without full model tuning. We theoretically show that PEANuT achieves equivalent or greater expressivity than existing linear PEFT methods with comparable or fewer parameters. Extensive experiments across four benchmarks with over twenty datasets demonstrate that PEANuT consistently outperforms strong baselines in both NLP and vision tasks, while maintaining low computational overhead.
♻ ☆ Sentence Smith: Controllable Edits for Evaluating Text Embeddings EMNLP 2025
Controllable and transparent text generation has been a long-standing goal in NLP. Almost as long-standing is a general idea for addressing this challenge: Parsing text to a symbolic representation, and generating from it. However, earlier approaches were hindered by parsing and generation insufficiencies. Using modern parsers and a safety supervision mechanism, we show how close current methods come to this goal. Concretely, we propose the Sentence Smith framework for English, which has three steps: 1. Parsing a sentence into a semantic graph. 2. Applying human-designed semantic manipulation rules. 3. Generating text from the manipulated graph. A final entailment check (4.) verifies the validity of the applied transformation. To demonstrate our framework's utility, we use it to induce hard negative text pairs that challenge text embedding models. Since the controllable generation makes it possible to clearly isolate different types of semantic shifts, we can evaluate text embedding models in a fine-grained way, also addressing an issue in current benchmarking where linguistic phenomena remain opaque. Human validation confirms that our transparent generation process produces texts of good quality. Notably, our way of generation is very resource-efficient, since it relies only on smaller neural networks.
comment: EMNLP 2025 (main), this version fixes a subscript typo in Eq 1
♻ ☆ Enhancing Domain-Specific Encoder Models with LLM-Generated Data: How to Leverage Ontologies, and How to Do Without Them EMNLP 2025
We investigate the use of LLM-generated data for continual pretraining of encoder models in specialized domains with limited training data, using the scientific domain of invasion biology as a case study. To this end, we leverage domain-specific ontologies by enriching them with LLM-generated data and pretraining the encoder model as an ontology-informed embedding model for concept definitions. To evaluate the effectiveness of this method, we compile a benchmark specifically designed for assessing model performance in invasion biology. After demonstrating substantial improvements over standard LLM pretraining, we investigate the feasibility of applying the proposed approach to domains without comprehensive ontologies by substituting ontological concepts with concepts automatically extracted from a small corpus of scientific abstracts and establishing relationships between concepts through distributional statistics. Our results demonstrate that this automated approach achieves comparable performance using only a small set of scientific abstracts, resulting in a fully automated pipeline for enhancing domain-specific understanding of small encoder models that is especially suited for application in low-resource settings and achieves performance comparable to masked language modeling pretraining on much larger datasets.
comment: Published in the Findings of the Association for Computational Linguistics: EMNLP 2025
♻ ☆ How does Alignment Enhance LLMs' Multilingual Capabilities? A Language Neurons Perspective AAAI 2026
Multilingual Alignment is an effective and representative paradigm to enhance LLMs' multilingual capabilities, which transfers the capabilities from the high-resource languages to the low-resource languages. Meanwhile, some research on language-specific neurons provides a new perspective to analyze and understand LLMs' mechanisms. However, we find that there are many neurons that are shared by multiple but not all languages and cannot be correctly classified. In this work, we propose a ternary classification methodology that categorizes neurons into three types, including language-specific neurons, language-related neurons, and general neurons. And we propose a corresponding identification algorithm to distinguish these different types of neurons. Furthermore, based on the distributional characteristics of different types of neurons, we divide the LLMs' internal process for multilingual inference into four parts: (1) multilingual understanding, (2) shared semantic space reasoning, (3) multilingual output space transformation, and (4) vocabulary space outputting. Additionally, we systematically analyze the models before and after alignment with a focus on different types of neurons. We also analyze the phenomenon of ''Spontaneous Multilingual Alignment''. Overall, our work conducts a comprehensive investigation based on different types of neurons, providing empirical results and valuable insights to better understand multilingual alignment and multilingual capabilities of LLMs.
comment: AAAI 2026 (Oral)
♻ ☆ ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation AACL 2025
Evaluating the quality of generated text automatically remains a significant challenge. Conventional reference-based metrics have been shown to exhibit relatively weak correlation with human evaluations. Recent research advocates the use of large language models (LLMs) as source-based metrics for natural language generation (NLG) assessment. While promising, LLM-based metrics, particularly those using smaller models, still fall short in aligning with human judgments. In this work, we introduce ContrastScore, a contrastive evaluation metric designed to enable higher-quality, less biased, and more efficient assessment of generated text. We evaluate ContrastScore on two NLG tasks: machine translation and summarization. Experimental results show that ContrastScore consistently achieves stronger correlation with human judgments than both single-model and ensemble-based baselines. Notably, ContrastScore based on Qwen 3B and 0.5B even outperforms Qwen 7B, despite having only half as many parameters, demonstrating its efficiency. Furthermore, it effectively mitigates common evaluation biases such as length and likelihood preferences, resulting in more robust automatic evaluation.
comment: Accepted at AACL 2025 (Main Conference Paper)
♻ ☆ Strategic Innovation Management in the Age of Large Language Models Market Intelligence, Adaptive R&D, and Ethical Governance
This study analyzes the multiple functions of Large Language Models (LLMs) in transforming research and development (R&D) processes. By automating knowledge discovery, boosting hypothesis creation, integrating transdisciplinary insights, and enabling cooperation within innovation ecosystems, LLMs dramatically improve the efficiency and effectiveness of research processes. Through extensive analysis of scientific literature, patent databases, and experimental data, these models enable more flexible and informed R&D workflows, ultimately accelerating innovation cycles and lowering time-to-market for breakthrough ideas.
♻ ☆ Word-level Annotation of GDPR Transparency Compliance in Privacy Policies using Large Language Models
Ensuring transparency of data practices related to personal information is a core requirement of the General Data Protection Regulation (GDPR). However, large-scale compliance assessment remains challenging due to the complexity and diversity of privacy policy language. Manual audits are labour-intensive and inconsistent, while current automated methods often lack the granularity required to capture nuanced transparency disclosures. In this paper, we present a modular large language model (LLM)-based pipeline for fine-grained word-level annotation of privacy policies with respect to GDPR transparency requirements. Our approach integrates LLM-driven annotation with passage-level classification, retrieval-augmented generation, and a self-correction mechanism to deliver scalable, context-aware annotations across 21 GDPR-derived transparency requirements. To support empirical evaluation, we compile a corpus of 703,791 English-language privacy policies and generate a ground-truth sample of 200 manually annotated policies based on a comprehensive, GDPR-aligned annotation scheme. We propose a two-tiered evaluation methodology capturing both passage-level classification and span-level annotation quality and conduct a comparative analysis of seven state-of-the-art LLMs on two annotation schemes, including the widely used OPP-115 dataset. The results of our evaluation show that decomposing the annotation task and integrating targeted retrieval and classification components significantly improve annotation accuracy, particularly for well-structured requirements. Our work provides new empirical resources and methodological foundations for advancing automated transparency compliance assessment at scale.
comment: Accepted to Proceedings on Privacy Enhancing Technologies (PoPETs) 1 (2026)
♻ ☆ A Survey of Generative Categories and Techniques in Multimodal Generative Models
Multimodal Generative Models (MGMs) have rapidly evolved beyond text generation, now spanning diverse output modalities including images, music, video, human motion, and 3D objects, by integrating language with other sensory modalities under unified architectures. This survey categorises six primary generative modalities and examines how foundational techniques, namely Self-Supervised Learning (SSL), Mixture of Experts (MoE), Reinforcement Learning from Human Feedback (RLHF), and Chain-of-Thought (CoT) prompting, enable cross-modal capabilities. We analyze key models, architectural trends, and emergent cross-modal synergies, while highlighting transferable techniques and unresolved challenges. Building on a common taxonomy of models and training recipes, we propose a unified evaluation framework centred on faithfulness, compositionality, and robustness, and synthesise evidence from benchmarks and human studies across modalities. We further analyse trustworthiness, safety, and ethical risks, including multimodal bias, privacy leakage, and the misuse of high-fidelity media generation for deepfakes, disinformation, and copyright infringement in music and 3D assets, together with emerging mitigation strategies. Finally, we discuss how architectural trends, evaluation protocols, and governance mechanisms can be co-designed to close current capability and safety gaps, outlining critical paths toward more general-purpose, controllable, and accountable multimodal generative systems.
♻ ☆ Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?
Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gödel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that LIVE-SWE-AGENT can achieve an impressive solve rate of 77.4% without test-time scaling, outperforming all existing software agents, including the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.
♻ ☆ Lost in translation: using global fact-checks to measure multilingual misinformation prevalence, spread, and evolution
Misinformation and disinformation are growing threats in the digital age, affecting people across languages and borders. However, no research has investigated the prevalence of multilingual misinformation and quantified the extent to which misinformation diffuses across languages. This paper investigates the prevalence and dynamics of multilingual misinformation through an analysis of 264,487 fact-checks spanning 95 languages. To study the evolution of claims over time and mutations across languages, we represent fact-checks with multilingual sentence embeddings and build a graph where semantically similar claims are linked. We provide quantitative evidence of repeated fact-checking efforts and establish that claims diffuse across languages. Specifically, we find that while the majority of misinformation claims are only fact-checked once, 10.26%, corresponding to more than 27,000 claims, are checked multiple times. Using fact-checks as a proxy for the spread of misinformation, we find 32.26% of repeated claims cross linguistic boundaries, suggesting that some misinformation permeates language barriers. However, spreading patterns exhibit strong assortativity, with misinformation more likely to spread within the same language or language family. Next we show that fact-checkers take more time to fact-check claims that have crossed language barriers and model the temporal and cross-lingual evolution of claims. We analyze connected components and shortest paths connecting different versions of a claim finding that claims gradually drift over time and undergo greater alteration when traversing languages. Misinformation changes over time, reducing the effectiveness of static claim matching algorithms. The findings advocate for expanded information sharing between fact-checkers globally while underscoring the importance of localized verification.
♻ ☆ In-Situ Tweedie Discrete Diffusion Models
While diffusion models excel at generating continuous data such as images, adapting them to discrete tasks has relied on indirect approaches that either operate in continuous embedding spaces or use token masking mechanisms, both of which deviate from modeling the true discrete data distribution that can be theoretically guaranteed by Tweedie's formula. We propose in-situ Tweedie Discrete Diffusion (TDD), a framework that performs diffusion guaranteed by Tweedie's formula directly within the discrete one-hot space, hence "in-situ." Unlike prior methods that diffuse continuous embeddings or mask tokens, TDD directly corrupts one-hot vectors with Gaussian noise and performs iterative denoising through a timestep-conditioned cross-entropy objective rather than mean-squared-error reconstruction. At each denoising step, the model predicts class probabilities, applies argmax to obtain discrete predictions, converts them to one-hot vectors, and feeds them into the next iteration with progressively reduced noise. This process naturally unifies discriminative classification and generative modeling under a single framework. Experiments demonstrate that TDD achieves strong performance on both image classification and text generation tasks, with extensive ablation studies confirming the effectiveness of each design component. Our work establishes a principled approach to discrete diffusion that preserves the core characteristics of diffusion models while operating natively in discrete space.
♻ ☆ AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking
Recent studies have shown that large language models (LLMs), especially smaller ones, often lack robustness in grade school math (GSM) reasoning. In particular, they tend to experience performance drops when faced with distribution shifts, such as changes to numerical or nominal variables, or insertions of distracting clauses. A possible strategy to address this involves generating synthetic data to further "instantiate" reasoning problems on potential variations. In this work, we instead focuses on the strategy of "abstracting" reasoning problems. This not only helps counteract distribution shifts but also facilitates the connection to symbolic tools for deriving solutions. Focusing on GSM, we find that this abstraction process is better acquired through reinforcement learning (RL) than just supervised fine-tuning, which often fails to produce faithful abstractions. Our method, AbstRaL -- which promotes abstract reasoning in LLMs using RL on granular abstraction data -- significantly mitigates performance degradation on recent GSM perturbation benchmarks. Besides, improving GSM robustness via AbstRaL is shown to also implicitly benefit LLMs' capabilities on OOD mathematical and general reasoning tasks, indicating that abstract thinking broadly enables better generalizability.
comment: Under review
♻ ☆ URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training NeurIPS 2025
Large Language Models (LLMs) are commonly pretrained on vast corpora of text without utilizing contextual metadata such as source, quality, or topic, leading to a context-free learning paradigm. While recent studies suggest that adding metadata like URL information as context (i.e., auxiliary inputs not used in the loss calculation) can improve training efficiency and downstream performance, they offer limited understanding of which types of metadata are truly effective and under what conditions. In this work, we conduct a systematic evaluation and find that not all metadata types contribute equally. Only URL context speeds up training, whereas quality scores and topic/format domain information offer no clear benefit. Furthermore, the improved downstream performances of URL conditioning emerge only when longer prompts are used at inference time. In addition, we demonstrate that context-aware pretraining enables more controllable generation than context-free pretraining, in a classifier-free guidance fashion. Although topic and format metadata do not accelerate training, they are effective for steering outputs, offering human-interpretable control over generation.
comment: NeurIPS 2025, Camera Ready
♻ ☆ ModernBERT is More Efficient than Conventional BERT for Chest CT Findings Classification in Japanese Radiology Reports
Japanese language models for medical text classification face challenges with complex vocabulary and linguistic structures in radiology reports. This study compared three Japanese models--BERT Base, JMedRoBERTa, and ModernBERT--for multi-label classification of 18 chest CT findings. Using the CT-RATE-JPN dataset, all models were fine-tuned under identical conditions. ModernBERT showed clear efficiency advantages, producing substantially fewer tokens and achieving faster training and inference than the other models while maintaining comparable performance on the internal test dataset (exact match accuracy: 74.7% vs. 72.7% for BERT Base). To assess generalizability, we additionally constructed RR-Findings, an external dataset of 243 naturally written Japanese radiology reports annotated using the same schema. Under this domain-shifted setting, performance differences became pronounced: BERT Base outperformed both JMedRoBERTa and ModernBERT, whereas ModernBERT showed the largest decline in exact match accuracy. Average precision differences were smaller, indicating that ModernBERT retained reasonable ranking ability despite reduced calibration. Overall, ModernBERT offers substantial computational efficiency and strong in-domain performance but remains sensitive to real-world linguistic variability. These results highlight the need for more diverse natural-language training data and domain-specific calibration strategies to improve robustness when deploying modern transformer models in heterogeneous clinical environments.
comment: 31 pages
♻ ☆ Entropy-Guided Reasoning Compression
Large reasoning models have demonstrated remarkable performance on complex reasoning tasks, yet the excessive length of their chain-of-thought outputs remains a major practical bottleneck due to high computation cost and poor deployability. Existing compression methods have achieved partial success but overlook a crucial phenomenon in the training process -- the entropy conflict. During compression training, entropy decreases, leading to shorter reasoning but limited exploration, while accuracy-oriented objectives increase entropy, lengthening reasoning chains. This can cause the model to get stuck in a local dilemma. Our analysis further reveals the origin of the entropy conflict: many high-entropy tokens are logical connectors that receive larger gradients and are encouraged under the performance objective, while the compression objective simultaneously penalizes these potentially redundant connectors. This opposing pressure creates a direct source of entropy conflict. To address these issues, we adopt an entropy-guided training framework. As entropy descends, the model is guided toward efficient reasoning by encouraging concise thought steps; as entropy rises, exploration is reinforced under the compact reasoning mode to improve robustness. Experiments on six mathematical benchmarks show that our method compresses reasoning length to 20% of the original while maintaining or even surpassing baseline accuracy. Code and models will be released publicly.
comment: 10pages, 4 figures
♻ ☆ Agent-OM: Leveraging LLM Agents for Ontology Matching
Ontology matching (OM) enables semantic interoperability between different ontologies and resolves their conceptual heterogeneity by aligning related entities. OM systems currently have two prevailing design paradigms: conventional knowledge-based expert systems and newer machine learning-based predictive systems. While large language models (LLMs) and LLM agents have revolutionised data engineering and have been applied creatively in many domains, their potential for OM remains underexplored. This study introduces a novel agent-powered LLM-based design paradigm for OM systems. With consideration of several specific challenges in leveraging LLM agents for OM, we propose a generic framework, namely Agent-OM (Agent for Ontology Matching), consisting of two Siamese agents for retrieval and matching, with a set of OM tools. Our framework is implemented in a proof-of-concept system. Evaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks over state-of-the-art OM systems show that our system can achieve results very close to the long-standing best performance on simple OM tasks and can significantly improve the performance on complex and few-shot OM tasks.
comment: 31 pages
♻ ☆ TRIM: Token Reduction and Inference Modeling for Cost-Effective Language Generation
The high inference cost of Large Language Models (LLMs) poses challenges, especially for tasks requiring lengthy outputs. However, natural language often contains redundancy, which presents an opportunity for optimization. We have observed that LLMs can generate distilled language (i.e., concise outputs that retain essential meaning) when prompted appropriately. We propose TRIM, a pipeline for saving computational cost in which the LLM omits a predefined set of semantically irrelevant and easily inferable words based on the context during inference. Then, a specifically trained smaller language model with lower inference cost reconstructs the distilled answer into the ideal answer. Our experiments show promising results, particularly on the proposed NaLDA evaluation dataset focused on the reconstruction task, with 19.4% saved tokens on average for GPT-4o and only a tiny decrease in evaluation metrics. This suggests that the approach can effectively balance efficiency and accuracy in language processing tasks.
comment: 16 pages, 9 tables, 5 figures
♻ ☆ Safeguarding Privacy of Retrieval Data against Membership Inference Attacks: Is This Query Too Close to Home? EMNLP
Retrieval-augmented generation (RAG) mitigates the hallucination problem in large language models (LLMs) and has proven effective for personalized usages. However, delivering private retrieved documents directly to LLMs introduces vulnerability to membership inference attacks (MIAs), which try to determine whether the target data point exists in the private external database or not. Based on the insight that MIA queries typically exhibit high similarity to only one target document, we introduce a novel similarity-based MIA detection framework designed for the RAG system. With the proposed method, we show that a simple detect-and-hide strategy can successfully obfuscate attackers, maintain data utility, and remain system-agnostic against MIA. We experimentally prove its detection and defense against various state-of-the-art MIA methods and its adaptability to existing RAG systems.
comment: Accepted for EMNLP findings 2025
♻ ☆ Evaluation of OpenAI o1: Opportunities and Challenges of AGI
This comprehensive study evaluates the performance of OpenAI's o1-preview large language model across a diverse array of complex reasoning tasks, spanning multiple domains, including computer science, mathematics, natural sciences, medicine, linguistics, and social sciences. Through rigorous testing, o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance in areas ranging from coding challenges to scientific reasoning and from language processing to creative problem-solving. Key findings include: -83.3% success rate in solving complex competitive programming problems, surpassing many human experts. -Superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models. -100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions. -Advanced natural language inference capabilities across general and specialized domains like medicine. -Impressive performance in chip design tasks, outperforming specialized models in areas such as EDA script generation and bug analysis. -Remarkable proficiency in anthropology and geology, demonstrating deep understanding and reasoning in these specialized fields. -Strong capabilities in quantitative investing. O1 has comprehensive financial knowledge and statistical modeling skills. -Effective performance in social media analysis, including sentiment analysis and emotion recognition. The model excelled particularly in tasks requiring intricate reasoning and knowledge integration across various fields. While some limitations were observed, including occasional errors on simpler problems and challenges with certain highly specialized concepts, the overall results indicate significant progress towards artificial general intelligence.
♻ ☆ Beyond SELECT: A Comprehensive Taxonomy-Guided Benchmark for Real-World Text-to-SQL Translation
Text-to-SQL datasets are essential for training and evaluating text-to-SQL models, but existing datasets often suffer from limited coverage and fail to capture the diversity of real-world applications. To address this, we propose a novel taxonomy for text-to-SQL classification based on dimensions including core intents, statement types, syntax structures, and key actions. Using this taxonomy, we evaluate widely used public text-to-SQL datasets (e.g., Spider and Bird) and reveal limitations in their coverage and diversity. We then introduce a taxonomy-guided dataset synthesis pipeline, yielding a new dataset named SQL-Synth. This approach combines the taxonomy with Large Language Models (LLMs) to ensure the dataset reflects the breadth and complexity of real-world text-to-SQL applications. Extensive analysis and experimental results validate the effectiveness of our taxonomy, as SQL-Synth exhibits greater diversity and coverage compared to existing benchmarks. Moreover, we uncover that existing LLMs typically fall short in adequately capturing the full range of scenarios, resulting in limited performance on SQL-Synth. However, fine-tuning can substantially improve their performance in these scenarios. The proposed taxonomy has significant potential impact, as it not only enables comprehensive analysis of datasets and the performance of different LLMs, but also guides the construction of training data for LLMs.
♻ ☆ SGM: A Framework for Building Specification-Guided Moderation Filters
Aligning large language models (LLMs) with deployment-specific requirements is critical but inherently imperfect. Despite extensive training, models remain susceptible to misalignment and adversarial inputs such as jailbreaks. Content moderation filters are commonly used as external safeguards, though they typically focus narrowly on safety. We introduce SGM (Specification-Guided Moderation), a flexible framework for training moderation filters grounded in user-defined specifications that go beyond standard safety concerns. SGM automates training data generation without relying on human-written examples, enabling scalable support for diverse, application-specific alignment goals. SGM-trained filters perform on par with state-of-the-art safety filters built on curated datasets, while supporting fine-grained and user-defined alignment control.
♻ ☆ DataSage: Multi-agent Collaboration for Insight Discovery with External Knowledge Retrieval, Multi-role Debating, and Multi-path Reasoning
In today's data-driven era, fully automated end-to-end data analytics, particularly insight discovery, is critical for discovering actionable insights that assist organizations in making effective decisions. With the rapid advancement of large language models (LLMs), LLM-driven agents have emerged as a promising paradigm for automating data analysis and insight discovery. However, existing data insight agents remain limited in several key aspects, often failing to deliver satisfactory results due to: (1) insufficient utilization of domain knowledge, (2) shallow analytical depth, and (3) error-prone code generation during insight generation. To address these issues, we propose DataSage, a novel multi-agent framework that incorporates three innovative features including external knowledge retrieval to enrich the analytical context, a multi-role debating mechanism to simulate diverse analytical perspectives and deepen analytical depth, and multi-path reasoning to improve the accuracy of the generated code and insights. Extensive experiments on InsightBench demonstrate that DataSage consistently outperforms existing data insight agents across all difficulty levels, offering an effective solution for automated data insight discovery.
♻ ☆ Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models EMNLP 2025
Large language models (LLMs) have witnessed rapid advancements, demonstrating remarkable capabilities. However, a notable vulnerability persists: LLMs often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs. This emphasizes the significance of possessing the \textbf{Premise Critique Ability} for LLMs, defined as the capacity to proactively identify and articulate errors in input premises. Most existing studies assess LLMs' reasoning ability in ideal settings, largely ignoring their vulnerabilities when faced with flawed premises. Thus, we introduce the \textbf{Premise Critique Bench (PCBench)}, designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics. We conducted systematic evaluations of 15 representative LLMs. Our findings reveal: (1) Most models rely heavily on explicit prompts to detect errors, with limited autonomous critique; (2) Premise critique ability depends on question difficulty and error type, with direct contradictions being easier to detect than complex or procedural errors; (3) Reasoning ability does not consistently correlate with the premise critique ability; (4) Flawed premises trigger overthinking in reasoning models, markedly lengthening responses due to repeated attempts at resolving conflicts. These insights underscore the urgent need to enhance LLMs' proactive evaluation of input validity, positioning premise critique as a foundational capability for developing reliable, human-centric systems. The code is available at https://github.com/MLGroupJLU/Premise_Critique.
comment: EMNLP 2025 Findings camera-ready version
♻ ☆ Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL
Generating accurate SQL from users' natural language questions (text-to-SQL) remains a long-standing challenge due to the complexities involved in user question understanding, database schema comprehension, and SQL generation. Traditional text-to-SQL systems, which combine human engineering and deep neural networks, have made significant progress. Subsequently, pre-trained language models (PLMs) have been developed for text-to-SQL tasks, achieving promising results. However, as modern databases and user questions grow more complex, PLMs with a limited parameter size often produce incorrect SQL. This necessitates more sophisticated and tailored optimization methods, which restricts the application of PLM-based systems. Recently, large language models (LLMs) have shown significant capabilities in natural language understanding as model scale increases. Thus, integrating LLM-based solutions can bring unique opportunities, improvements, and solutions to text-to-SQL research. In this survey, we provide a comprehensive review of existing LLM-based text-to-SQL studies. Specifically, we offer a brief overview of the technical challenges and evolutionary process of text-to-SQL. Next, we introduce the datasets and metrics designed to evaluate text-to-SQL systems. Subsequently, we present a systematic analysis of recent advances in LLM-based text-to-SQL. Finally, we make a summarization and discuss the remaining challenges in this field and suggest expectations for future research directions. All the related resources of LLM-based, including research papers, benchmarks, and open-source projects, are collected for the community in our repository: https://github.com/DEEP-PolyU/Awesome-LLM-based-Text2SQL.
comment: Accepted to IEEE TKDE2025
♻ ☆ Systematic Reward Gap Optimization for Mitigating VLM Hallucinations NeurIPS 2025
The success of Direct Preference Optimization (DPO) in mitigating hallucinations in Vision Language Models (VLMs) critically hinges on the true reward gaps within preference pairs. However, current methods, typically relying on ranking or rewriting strategies, often struggle to optimize these reward gaps in a systematic way during data curation. A core difficulty lies in precisely characterizing and strategically manipulating the overall reward gap configuration, that is, the deliberate design of how to shape these reward gaps within each preference pair across the data. To address this, we introduce Topic-level Preference Rewriting(TPR), a novel framework designed for the systematic optimization of reward gap configuration. Through selectively replacing semantic topics within VLM responses with model's own resampled candidates for targeted rewriting, TPR can provide topic-level control over fine-grained semantic details. This precise control enables advanced data curation strategies, such as progressively adjusting the difficulty of rejected responses, thereby sculpting an effective reward gap configuration that guides the model to overcome challenging hallucinations. Comprehensive experiments demonstrate TPR achieves state-of-the-art performance on multiple hallucination benchmarks, outperforming previous methods by an average of 20%. Notably, it significantly reduces hallucinations by up to 93% on ObjectHal-Bench, and also exhibits superior data efficiency towards robust and cost-effective VLM alignment. Code and datasets are available at https://tpr-dpo.github.io .
comment: 34 pages, 12 figures, Accepted by NeurIPS 2025
♻ ☆ IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding
Recent advances in vision-language models (VLMs) have significantly enhanced the visual grounding task, which involves locating objects in an image based on natural language queries. Despite these advancements, the security of VLM-based grounding systems has not been thoroughly investigated. This paper reveals a novel and realistic vulnerability: the first multi-target backdoor attack on VLM-based visual grounding. Unlike prior attacks that rely on static triggers or fixed targets, we propose IAG, a method that dynamically generates input-aware, text-guided triggers conditioned on any specified target object description to execute the attack. This is achieved through a text-conditioned UNet that embeds imperceptible target semantic cues into visual inputs while preserving normal grounding performance on benign samples. We further develop a joint training objective that balances language capability with perceptual reconstruction to ensure imperceptibility, effectiveness, and stealth. Extensive experiments on multiple VLMs (e.g., LLaVA, InternVL, Ferret) and benchmarks (RefCOCO, RefCOCO+, RefCOCOg, Flickr30k Entities, and ShowUI) demonstrate that IAG achieves the best ASRs compared with other baselines on almost all settings without compromising clean accuracy, maintaining robustness against existing defenses, and exhibiting transferability across datasets and models. These findings underscore critical security risks in grounding-capable VLMs and highlight the need for further research on trustworthy multimodal understanding.
comment: 20 pages, 13 Figures
♻ ☆ BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models AAAI 2026
Although large language models (LLMs) demonstrate impressive proficiency in various tasks, they present potential safety risks, such as `jailbreaks', where malicious inputs can coerce LLMs into generating harmful content bypassing safety alignments. In this paper, we delve into the ethical biases in LLMs and examine how those biases could be exploited for jailbreaks. Notably, these biases result in a jailbreaking success rate in GPT-4o models that differs by 20\% between non-binary and cisgender keywords and by 16\% between white and black keywords, even when the other parts of the prompts are identical. We introduce the concept of BiasJailbreak, highlighting the inherent risks posed by these safety-induced biases. BiasJailbreak generates biased keywords automatically by asking the target LLM itself, and utilizes the keywords to generate harmful output. Additionally, we propose an efficient defense method BiasDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation. BiasDefense stands as an appealing alternative to Guard Models, such as Llama-Guard, that require additional inference cost after text generation. Our findings emphasize that ethical biases in LLMs can actually lead to generating unsafe output, and suggest a method to make the LLMs more secure and unbiased. To enable further research and improvements, we open-source our code and artifacts of BiasJailbreak, providing the community with tools to better understand and mitigate safety-induced biases in LLMs.
comment: Accepted as a workshop paper at AAAI 2026
♻ ☆ Can Code-Switched Texts Activate a Knowledge Switch in LLMs? A Case Study on English-Korean Code-Switching EMNLP 2025
Recent large language models (LLMs) demonstrate multilingual abilities, yet they are English-centric due to dominance of English in training corpora. The limited resource for low-resource languages remains a crucial challenge. Code-switching (CS), a phenomenon where multilingual speakers alternate between languages in a discourse, can convey subtle cultural and linguistic nuances that can be otherwise lost in translation and elicits language-specific knowledge in human communications. In light of this, we investigate whether code-switching can activate, or identify and leverage knowledge for reasoning when LLMs solve low-resource language tasks. To facilitate the research, we first present EnKoQA, a synthetic English-Korean CS question-answering dataset. We provide comprehensive analysis on a variety of multilingual LLMs by subdividing activation process into knowledge identification and knowledge leveraging. Our results demonstrate that compared to English text, CS can faithfully activate knowledge inside LLMs especially on language-specific domains, suggesting the potential of code-switching on low-resource language tasks.
comment: Accepted to EMNLP 2025 Findings
♻ ☆ SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning
Long-context inference for Large Language Models (LLMs) is heavily limited by high computational demands. While several existing methods optimize attention computation, they still process the full set of hidden states at each layer, limiting overall efficiency. In this work, we propose SlimInfer, an innovative framework that aims to accelerate inference by directly pruning less critical prompt tokens during the forward pass. Our key insight is an information diffusion phenomenon: As information from critical tokens propagates through layers, it becomes distributed across the entire sequence. This diffusion process suggests that LLMs can maintain their semantic integrity when excessive tokens, even including these critical ones, are pruned in hidden states. Motivated by this, SlimInfer introduces a dynamic fine-grained pruning mechanism that accurately removes redundant tokens of hidden state at intermediate layers. This layer-wise pruning naturally enables an asynchronous KV cache manager that prefetches required token blocks without complex predictors, reducing both memory usage and I/O costs. Extensive experiments show that SlimInfer can achieve up to $\mathbf{2.53\times}$ time-to-first-token (TTFT) speedup and $\mathbf{1.88\times}$ end-to-end latency reduction for LLaMA3.1-8B-Instruct on a single RTX 4090, without sacrificing performance on LongBench. Our code is available at https://github.com/Longxmas/SlimInfer.
♻ ☆ REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning
Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (Leansearch-PS), notably boosts performance on solving college-level mathematics problems. To train REAL-Prover-v1, we developed HERALD-AF, a data extraction pipeline that converts natural language math problems into formal statements, and a new open-source Lean 4 interactive environment (Jixia-interactive) to facilitate synthesis data collection. In our experiments, our prover using only supervised fine-tune achieves competitive results with a 23.7% success rate (Pass@64) on the ProofNet dataset-comparable to state-of-the-art (SOTA) models. To further evaluate our approach, we introduce FATE-M, a new benchmark focused on algebraic problems, where our prover achieves a SOTA success rate of 56.7% (Pass@64).
♻ ☆ Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across various model families, RL algorithms, and math, coding, and visual reasoning benchmarks, using pass@k at large k values as the evaluation metric. Surprisingly, we find that the current training setup does not elicit fundamentally new reasoning patterns. While RLVR-trained models outperform their base models at small k (e.g., k = 1), the base models achieve a higher pass@k score when k is large. Coverage and perplexity analyses show that the observed reasoning abilities originate from and are bounded by the base model. Treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in leveraging the potential of the base model. By contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model's reasoning capabilities. Overall, our findings suggest that current RLVR methods have not yet realized the potential of RL to elicit truly novel reasoning abilities in LLMs. This highlights the need for improved RL paradigms, such as continual scaling and multi-turn agent-environment interaction, to unlock this potential.
comment: 31 pages, 27 figures
♻ ☆ SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth AAAI 2026
The rapid proliferation of large language models (LLMs) in applications targeting children and adolescents necessitates a fundamental reassessment of prevailing AI safety frameworks, which are largely tailored to adult users and neglect the distinct developmental vulnerabilities of minors. This paper highlights key deficiencies in existing LLM safety benchmarks, including their inadequate coverage of age-specific cognitive, emotional, and social risks spanning early childhood (ages 0--6), middle childhood (7--12), and adolescence (13--18). To bridge these gaps, we introduce SproutBench, an innovative evaluation suite comprising 1,283 developmentally grounded adversarial prompts designed to probe risks such as emotional dependency, privacy violations, and imitation of hazardous behaviors. Through rigorous empirical evaluation of 47 diverse LLMs, we uncover substantial safety vulnerabilities, corroborated by robust inter-dimensional correlations (e.g., between Safety and Risk Prevention) and a notable inverse relationship between Interactivity and Age Appropriateness. These insights yield practical guidelines for advancing child-centric AI design and deployment.
comment: Accepted in AAAI 2026 Workshop on AI for Education
♻ ☆ SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage ACL
Large language models (LLMs) have made significant advancements across various tasks, but their safety alignment remain a major concern. Exploring jailbreak prompts can expose LLMs' vulnerabilities and guide efforts to secure them. Existing methods primarily design sophisticated instructions for the LLM to follow, or rely on multiple iterations, which could hinder the performance and efficiency of jailbreaks. In this work, we propose a novel jailbreak paradigm, Simple Assistive Task Linkage (SATA), which can effectively circumvent LLM safeguards and elicit harmful responses. Specifically, SATA first masks harmful keywords within a malicious query to generate a relatively benign query containing one or multiple [MASK] special tokens. It then employs a simple assistive task such as a masked language model task or an element lookup by position task to encode the semantics of the masked keywords. Finally, SATA links the assistive task with the masked query to jointly perform the jailbreak. Extensive experiments show that SATA achieves state-of-the-art performance and outperforms baselines by a large margin. Specifically, on AdvBench dataset, with mask language model (MLM) assistive task, SATA achieves an overall attack success rate (ASR) of 85% and harmful score (HS) of 4.57, and with element lookup by position (ELP) assistive task, SATA attains an overall ASR of 76% and HS of 4.43.
comment: ACL Findings 2025. Welcome to employ SATA as a baseline
♻ ☆ Can Large Language Models Detect Misinformation in Scientific News Reporting?
Scientific facts are often spun in the popular press with the intent to influence public opinion and action, as was evidenced during the COVID-19 pandemic. Automatic detection of misinformation in the scientific domain is challenging because of the distinct styles of writing in these two media types and is still in its nascence. Most research on the validity of scientific reporting treats this problem as a claim verification challenge. In doing so, significant expert human effort is required to generate appropriate claims. Our solution bypasses this step and addresses a more real-world scenario where such explicit, labeled claims may not be available. The central research question of this paper is whether it is possible to use large language models (LLMs) to detect misinformation in scientific reporting. To this end, we first present a new labeled dataset SciNews, containing 2.4k scientific news stories drawn from trusted and untrustworthy sources, paired with related abstracts from the CORD-19 database. Our dataset includes both human-written and LLM-generated news articles, making it more comprehensive in terms of capturing the growing trend of using LLMs to generate popular press articles. Then, we identify dimensions of scientific validity in science news articles and explore how this can be integrated into the automated detection of scientific misinformation. We propose several baseline architectures using LLMs to automatically detect false representations of scientific findings in the popular press. For each of these architectures, we use several prompt engineering strategies including zero-shot, few-shot, and chain-of-thought prompting. We also test these architectures and prompting strategies on GPT-3.5, GPT-4, and Llama2-7B, Llama2-13B.
♻ ☆ GMoE: Empowering LLMs Fine-Tuning via MoE Graph Collaboration
The sparse Mixture-of-Experts (MoE) architecture of large language models (LLMs) confronts an inherent issue of load imbalance arising from the simplistic linear router strategy, which ultimately causes the instability and inefficient learning of LLMs. To address this challenge, we introduce a novel MoE graph-based framework $\textbf{GMoE}$, aimed at enhancing the collaboration among multiple experts. In GMoE, a graph router function is designed to capture the collaboration signals among experts. This enables all experts to dynamically allocate information derived from input data by sharing information with their neighboring experts. Moreover, we put forward two coordination strategies in GMoE: the $\textit{Poisson distribution-based distinction strategy}$ and the $\textit{Normal distribution-based balance strategy}$, to further release the capacity of each expert and increase the model stability in the fine-tuning of LLMs. Specifically, we leverage a parameter-efficient fine-tuning technique, i.e., Low-Rank Adaptation (LoRA), to implement the graph MoE architecture. Extensive experiments on four real-world benchmark datasets demonstrate the effectiveness of GMoE, showing the benefits of facilitating collaborations of multiple experts in LLM fine-tuning. The code of experimental implementation is available at https://github.com/BAI-LAB/GMoE
comment: 9 pages, 25 figures
♻ ☆ Ellipsoid-Based Decision Boundaries for Open Intent Classification
Textual open intent classification is crucial for real-world dialogue systems, enabling robust detection of unknown user intents without prior knowledge and contributing to the robustness of the system. While adaptive decision boundary methods have shown great potential by eliminating manual threshold tuning, existing approaches assume isotropic distributions of known classes, restricting boundaries to balls and overlooking distributional variance along different directions. To address this limitation, we propose EliDecide, a novel method that learns ellipsoid decision boundaries with varying scales along different feature directions. First, we employ supervised contrastive learning to obtain a discriminative feature space for known samples. Second, we apply learnable matrices to parameterize ellipsoids as the boundaries of each known class, offering greater flexibility than spherical boundaries defined solely by centers and radii. Third, we optimize the boundaries via a novelly designed dual loss function that balances empirical and open-space risks: expanding boundaries to cover known samples while contracting them against synthesized pseudo-open samples. Our method achieves state-of-the-art performance on multiple text intent benchmarks and further on a question classification dataset. The flexibility of the ellipsoids demonstrates superior open intent detection capability and strong potential for generalization to more text classification tasks in diverse complex open-world scenarios.
♻ ☆ Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT
Multiple-choice question answering (MCQA) has been a popular format for evaluating and reinforcement fine-tuning (RFT) of modern multimodal language models. Its constrained output format allows for simplified, deterministic automatic verification. However, we find that the options may leak exploitable signals, which makes the accuracy metrics unreliable for indicating real capabilities and encourages explicit or implicit answer guessing behaviors during RFT. We propose ReVeL (Rewrite and Verify by LLM), a framework that rewrites multiple-choice questions into open-form questions while keeping answers verifiable whenever possible. The framework categorizes questions according to different answer types, apply different rewriting and verification schemes, respectively. When applied for RFT, we converted 20k MCQA examples and use GRPO to finetune Qwen2.5-VL models. Models trained on ReVeL-OpenQA match MCQA accuracy on multiple-choice benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and more robust reward signals than MCQA-based training. When used for evaluation, ReVeL also reveals up to 20 percentage points of score inflation in MCQA benchmarks (relative to OpenQA), improves judging accuracy, and reduces both cost and latency. We will release code and data publicly.
comment: Project url: https://flageval-baai.github.io/ReVeL/
♻ ☆ Personalized LLM Decoding via Contrasting Personal Preference EMNLP 2025
As large language models (LLMs) are progressively deployed in various real-world applications, personalization of LLMs has become increasingly important. While various approaches to LLM personalization such as prompt-based and training-based methods have been actively explored, the development of effective decoding-time algorithms remains largely overlooked, despite their demonstrated potential. In this paper, we propose CoPe (Contrasting Personal Preference), a novel decoding-time approach applied after performing parameter-efficient fine-tuning (PEFT) on user-specific data. Our core idea is to leverage reward-guided decoding specifically for personalization by maximizing each user's implicit reward signal. We evaluate CoPe across five open-ended personalized text generation tasks. Our empirical results demonstrate that CoPe achieves strong performance, improving personalization by an average of 10.57% in ROUGE-L, without relying on external reward models or additional training procedures.
comment: EMNLP 2025 Main
♻ ☆ Advancing Multi-Agent RAG Systems with Minimalist Reinforcement Learning
Large Language Models (LLMs) equipped with modern Retrieval-Augmented Generation (RAG) systems often employ multi-turn interaction pipelines to interface with search engines for complex reasoning tasks. However, such multi-turn interactions inevitably produce long intermediate contexts, as context length grows exponentially with exploration depth. This leads to a well-known limitation of LLMs: their difficulty in effectively leveraging information from long contexts. This problem is further amplified in RAG systems that depend on in-context learning, where few-shot demonstrations must also be included in the prompt, compounding the context-length bottleneck. To address these challenges, we propose Mujica-MyGo, a unified framework for efficient multi-turn reasoning in RAG. Inspired by the divide-and-conquer principle, we introduce Mujica (Multi-hop Joint Intelligence for Complex Question Answering), a multi-agent RAG workflow that decomposes multi-turn interactions into cooperative sub-interactions, thereby mitigating long-context issues. To eliminate the dependency on in-context learning, we further develop MyGO (Minimalist Policy Gradient Optimization), a lightweight and efficient reinforcement learning algorithm that enables effective post-training of LLMs within complex RAG pipelines. We provide theoretical guarantees for MyGO's convergence to the optimal policy. Empirical evaluations across diverse question-answering benchmarks, covering both text corpora and knowledge graphs, show that Mujica-MyGO achieves superior performance.
Artificial Intelligence
☆ VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection
We present VDC-Agent, a self-evolving framework for Video Detailed Captioning that requires neither human annotations nor larger teacher models. The agent forms a closed loop of caption generation, principle-guided scoring (score and textual suggestions), and prompt refinement. When caption quality regresses, a self-reflection path leverages the previous chain-of-thought to amend the update. Running this process on unlabeled videos produces trajectories of (caption, score) pairs. We convert the trajectories into preference tuples and filter out samples with JSON parsing errors, resulting in VDC-Agent-19K, which contains 18,886 automatically constructed pairs. We then fine-tune the base MLLM on this dataset using an easy-to-hard curriculum direct preference optimization. Built on Qwen2.5-VL-7B-Instruct, our VDC-Agent-7B attains state-of-the-art performance on the VDC benchmark with 49.08% average accuracy and 2.50 score, surpassing specialized video captioners and improving over the base model by +5.13% accuracy and +0.27 score at similar inference cost.
☆ Mixture of Horizons in Action Chunking
Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $π_0$, $π_{0.5}$, and one-step regression policy $π_{\text{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $π_{0.5}$ with MoH reaches a new state-of-the-art with 99$\%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://github.com/Timsty1/MixtureOfHorizons
comment: 15 pages, 14 figures
Prompt Less, Smile More: MTP with Semantic Engineering in Lieu of Prompt Engineering
AI-Integrated programming is emerging as a foundational paradigm for building intelligent systems with large language models (LLMs). Recent approaches such as Meaning Typed Programming (MTP) automate prompt generation by leveraging the semantics already present in code. However, many real-world applications depend on contextual cues, developer intent, and domain-specific reasoning that extend beyond what static code semantics alone can express. To address this limitation, we introduce Semantic Engineering, a lightweight method for enriching program semantics so that LLM-based systems can more accurately reflect developer intent without requiring full manual prompt design. We present Semantic Context Annotations (SemTexts), a language-level mechanism that allows developers to embed natural-language context directly into program constructs. Integrated into the Jac programming language, Semantic Engineering extends MTP to incorporate these enriched semantics during prompt generation. We further introduce a benchmark suite designed to reflect realistic AI-Integrated application scenarios. Our evaluation shows that Semantic Engineering substantially improves prompt fidelity, achieving performance comparable to Prompt Engineering while requiring significantly less developer effort.
☆ Beyond Protein Language Models: An Agentic LLM Framework for Mechanistic Enzyme Design
We present Genie-CAT, a tool-augmented large-language-model (LLM) system designed to accelerate scientific hypothesis generation in protein design. Using metalloproteins (e.g., ferredoxins) as a case study, Genie-CAT integrates four capabilities -- literature-grounded reasoning through retrieval-augmented generation (RAG), structural parsing of Protein Data Bank files, electrostatic potential calculations, and machine-learning prediction of redox properties -- into a unified agentic workflow. By coupling natural-language reasoning with data-driven and physics-based computation, the system generates mechanistically interpretable, testable hypotheses linking sequence, structure, and function. In proof-of-concept demonstrations, Genie-CAT autonomously identifies residue-level modifications near [Fe--S] clusters that affect redox tuning, reproducing expert-derived hypotheses in a fraction of the time. The framework highlights how AI agents combining language models with domain-specific tools can bridge symbolic reasoning and numerical simulation, transforming LLMs from conversational assistants into partners for computational discovery.
comment: 10 pages, 4 figures
☆ SLMFix: Leveraging Small Language Models for Error Fixing with Reinforcement Learning
Recent advancements in large language models (LLMs) have shown very impressive capabilities in code generation across many programming languages. However, even state-of-the-art LLMs generate programs that contains syntactic errors and fail to complete the given tasks, especially for low-resource programming languages (LRPLs). In addition, high training cost makes finetuning LLMs unaffordable with constrained computational resources, further undermining the effectiveness of LLMs for code generation. In this work, we propose SLMFix, a novel code generation pipeline that leverages a small language model (SLM) finetuned using reinforcement learning (RL) techniques to fix syntactic errors in LLM-generated programs to improve the quality of LLM-generated programs for domain-specific languages (DSLs). In specific, we applied RL on the SLM for the program repair task using a reward calculated using both a static validator and a static semantic similarity metric. Our experimental results demonstrate the effectiveness and generalizability of our approach across multiple DSLs, achieving more than 95% pass rate on the static validator. Notably, SLMFix brings substantial improvement to the base model and outperforms supervised finetuning approach even for 7B models on a LRPL, showing the potential of our approach as an alternative to traditional finetuning approaches.
☆ Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
Vision-Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness. This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, COVT distills knowledge from lightweight vision experts, capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, the VLM with COVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, including CV-Bench, MMVP, RealWorldQA, MMStar, WorldMedQA, and HRBench, integrating COVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance by 3% to 16% and demonstrates that compact continuous visual thinking enables more precise, grounded, and interpretable multimodal intelligence.
comment: Project page: https://wakalsprojectpage.github.io/comt-website/
☆ Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration
Large Language Models (LLMs) have demonstrated remarkable capabilities in challenging, knowledge-intensive reasoning tasks. However, extending LLMs to perceive and reason over a new modality (e.g., vision), often requires costly development of large-scale vision language models (VLMs) with LLMs as backbones. Smaller VLMs are more efficient and adaptable but often lack the broad knowledge and reasoning capabilities of frontier LLMs. In this work, we propose BeMyEyes, a modular, multi-agent framework for extending LLMs to multimodal reasoning by orchestrating collaboration between efficient, adaptable VLMs as perceivers and powerful LLMs as reasoners through conversations. We then introduce a data synthesis and supervised fine-tuning pipeline to train the perceiver agent to effectively collaborate with the reasoner agent. By combining the complementary strengths of perception and reasoning agents, BeMyEyes avoids the need for training large-scale multimodal models, preserves the generalization and reasoning capabilities of LLMs, and allows flexible extension to new domains and modalities. Experiments show that our framework unlocks the multimodal reasoning capabilities for LLMs, enabling a lightweight and fully open-source solution, i.e. equipping text-only DeepSeek-R1 with Qwen2.5-VL-7B perceiver, to outperform large-scale proprietary VLMs such as GPT-4o on a wide range of knowledge-intensive multimodal tasks. These results demonstrate the effectiveness, modularity, and scalability of our multi-agent approach for building future multimodal reasoning systems.
☆ UniGame: Turning a Unified Multimodal Model Into Its Own Adversary
Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02), out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/UniGame
☆ In-Video Instructions: Visual Signals as Generative Control
Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.
☆ DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research
Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.
☆ Real-Time Object Tracking with On-Device Deep Learning for Adaptive Beamforming in Dynamic Acoustic Environments
Advances in object tracking and acoustic beamforming are driving new capabilities in surveillance, human-computer interaction, and robotics. This work presents an embedded system that integrates deep learning-based tracking with beamforming to achieve precise sound source localization and directional audio capture in dynamic environments. The approach combines single-camera depth estimation and stereo vision to enable accurate 3D localization of moving objects. A planar concentric circular microphone array constructed with MEMS microphones provides a compact, energy-efficient platform supporting 2D beam steering across azimuth and elevation. Real-time tracking outputs continuously adapt the array's focus, synchronizing the acoustic response with the target's position. By uniting learned spatial awareness with dynamic steering, the system maintains robust performance in the presence of multiple or moving sources. Experimental evaluation demonstrates significant gains in signal-to-interference ratio, making the design well-suited for teleconferencing, smart home devices, and assistive technologies.
☆ Predicting partially observable dynamical systems via diffusion models with a multiscale inference scheme
Conditional diffusion models provide a natural framework for probabilistic prediction of dynamical systems and have been successfully applied to fluid dynamics and weather prediction. However, in many settings, the available information at a given time represents only a small fraction of what is needed to predict future states, either due to measurement uncertainty or because only a small fraction of the state can be observed. This is true for example in solar physics, where we can observe the Sun's surface and atmosphere, but its evolution is driven by internal processes for which we lack direct measurements. In this paper, we tackle the probabilistic prediction of partially observable, long-memory dynamical systems, with applications to solar dynamics and the evolution of active regions. We show that standard inference schemes, such as autoregressive rollouts, fail to capture long-range dependencies in the data, largely because they do not integrate past information effectively. To overcome this, we propose a multiscale inference scheme for diffusion models, tailored to physical processes. Our method generates trajectories that are temporally fine-grained near the present and coarser as we move farther away, which enables capturing long-range temporal dependencies without increasing computational cost. When integrated into a diffusion model, we show that our inference scheme significantly reduces the bias of the predicted distributions and improves rollout stability.
☆ An Anatomy Aware Hybrid Deep Learning Framework for Lung Cancer Tumor Stage Classification
Accurate lung cancer tumor staging is crucial for prognosis and treatment planning. However, it remains challenging for end-to-end deep learning approaches, as such approaches often overlook spatial and anatomical information that are central to the tumor-node-metastasis system. The tumor stage depends on multiple quantitative criteria, including the tumor size and its proximity to the nearest anatomical structures, and small variations can alter the staging outcome. We propose a medically grounded hybrid pipeline that performs staging by explicitly measuring the tumor's size and distance properties rather than treating it as a pure image classification task. Our method employs specialized encoder-decoder networks to precisely segment the lung and adjacent anatomy, including the lobes, tumor, mediastinum, and diaphragm. Subsequently, we extract the necessary tumor properties, i.e. measure the largest tumor dimension and calculate the distance between the tumor and neighboring anatomical structures by a quantitative analysis of the segmentation masks. Finally, we apply rule-based tumor staging aligned with the medical guidelines. This novel framework has been evaluated on the Lung-PET-CT-Dx dataset, demonstrating superior performance compared to traditional deep learning models, achieving an overall classification accuracy of 91.36%. We report the per-stage F1-scores of 0.93 (T1), 0.89 (T2), 0.96 (T3), and 0.90 (T4), a critical evaluation aspect often omitted in prior literature. To our knowledge, this is the first study that embeds explicit clinical context into tumor stage classification. Unlike standard convolutional neural networks that operate in an uninterpretable "black box" manner, our method offers both state-of-the-art performance and transparent decision support.
☆ DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison. Codes are publicly available at https://github.com/Zehong-Ma/DeCo.
comment: Project Page: https://zehong-ma.github.io/DeCo. Code Repository: https://github.com/Zehong-Ma/DeCo
☆ Leveraging LLMs for reward function design in reinforcement learning control tasks
The challenge of designing effective reward functions in reinforcement learning (RL) represents a significant bottleneck, often requiring extensive human expertise and being time-consuming. Previous work and recent advancements in large language models (LLMs) have demonstrated their potential for automating the generation of reward functions. However, existing methodologies often require preliminary evaluation metrics, human-engineered feedback for the refinement process, or the use of environmental source code as context. To address these limitations, this paper introduces LEARN-Opt (LLM-based Evaluator and Analyzer for Reward functioN Optimization). This LLM-based, fully autonomous, and model-agnostic framework eliminates the need for preliminary metrics and environmental source code as context to generate, execute, and evaluate reward function candidates from textual descriptions of systems and task objectives. LEARN-Opt's main contribution lies in its ability to autonomously derive performance metrics directly from the system description and the task objective, enabling unsupervised evaluation and selection of reward functions. Our experiments indicate that LEARN-Opt achieves performance comparable to or better to that of state-of-the-art methods, such as EUREKA, while requiring less prior knowledge. We find that automated reward design is a high-variance problem, where the average-case candidate fails, requiring a multi-run approach to find the best candidates. Finally, we show that LEARN-Opt can unlock the potential of low-cost LLMs to find high-performing candidates that are comparable to, or even better than, those of larger models. This demonstrated performance affirms its potential to generate high-quality reward functions without requiring any preliminary human-defined metrics, thereby reducing engineering overhead and enhancing generalizability.
☆ Explicit Tonal Tension Conditioning via Dual-Level Beam Search for Symbolic Music Generation
State-of-the-art symbolic music generation models have recently achieved remarkable output quality, yet explicit control over compositional features, such as tonal tension, remains challenging. We propose a novel approach that integrates a computational tonal tension model, based on tonal interval vector analysis, into a Transformer framework. Our method employs a two-level beam search strategy during inference. At the token level, generated candidates are re-ranked using model probability and diversity metrics to maintain overall quality. At the bar level, a tension-based re-ranking is applied to ensure that the generated music aligns with a desired tension curve. Objective evaluations indicate that our approach effectively modulates tonal tension, and subjective listening tests confirm that the system produces outputs that align with the target tension. These results demonstrate that explicit tension conditioning through a dual-level beam search provides a powerful and intuitive tool to guide AI-generated music. Furthermore, our experiments demonstrate that our method can generate multiple distinct musical interpretations under the same tension condition.
comment: 12 pages, 2 Figures, Accepted at the 17th International Symposium on Computer Music Multidisciplinary Research (CMMR) 2025
☆ Generative Query Expansion with Multilingual LLMs for Cross-Lingual Information Retrieval
Query expansion is the reformulation of a user query by adding semantically related information, and is an essential component of monolingual and cross-lingual information retrieval used to ensure that relevant documents are not missed. Recently, multilingual large language models (mLLMs) have shifted query expansion from semantic augmentation with synonyms and related words to pseudo-document generation. Pseudo-documents both introduce additional relevant terms and bridge the gap between short queries and long documents, which is particularly beneficial in dense retrieval. This study evaluates recent mLLMs and fine-tuned variants across several generative expansion strategies to identify factors that drive cross-lingual retrieval performance. Results show that query length largely determines which prompting technique is effective, and that more elaborate prompts often do not yield further gains. Substantial linguistic disparities persist: cross-lingual query expansion can produce the largest improvements for languages with the weakest baselines, yet retrieval is especially poor between languages written in different scripts. Fine-tuning is found to lead to performance gains only when the training and test data are of similar format. These outcomes underline the need for more balanced multilingual and cross-lingual training and evaluation resources.
☆ What Drives Cross-lingual Ranking? Retrieval Approaches with Multilingual Language Models
Cross-lingual information retrieval (CLIR) enables access to multilingual knowledge but remains challenging due to disparities in resources, scripts, and weak cross-lingual semantic alignment in embedding models. Existing pipelines often rely on translation and monolingual retrieval heuristics, which add computational overhead and noise, degrading performance. This work systematically evaluates four intervention types, namely document translation, multilingual dense retrieval with pretrained encoders, contrastive learning at word, phrase, and query-document levels, and cross-encoder re-ranking, across three benchmark datasets. We find that dense retrieval models trained specifically for CLIR consistently outperform lexical matching methods and derive little benefit from document translation. Contrastive learning mitigates language biases and yields substantial improvements for encoders with weak initial alignment, and re-ranking can be effective, but depends on the quality of the cross-encoder training data. Although high-resource languages still dominate overall performance, gains over lexical and document-translated baselines are most pronounced for low-resource and cross-script pairs. These findings indicate that cross-lingual search systems should prioritise semantic multilingual embeddings and targeted learning-based alignment over translation-based pipelines, particularly for cross-script and under-resourced languages.
☆ Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach
Recent fine-tuning techniques for diffusion models enable them to reproduce specific image sets, such as particular faces or artistic styles, but also introduce copyright and security risks. Dataset watermarking has been proposed to ensure traceability by embedding imperceptible watermarks into training images, which remain detectable in outputs even after fine-tuning. However, current methods lack a unified evaluation framework. To address this, this paper establishes a general threat model and introduces a comprehensive evaluation framework encompassing Universality, Transmissibility, and Robustness. Experiments show that existing methods perform well in universality and transmissibility, and exhibit some robustness against common image processing operations, yet still fall short under real-world threat scenarios. To reveal these vulnerabilities, the paper further proposes a practical watermark removal method that fully eliminates dataset watermarks without affecting fine-tuning, highlighting a key challenge for future research.
☆ PRInTS: Reward Modeling for Long-Horizon Information Seeking
Information-seeking is a core capability for AI agents, requiring them to gather and reason over tool-generated information across long trajectories. However, such multi-step information-seeking tasks remain challenging for agents backed by language models. While process reward models (PRMs) can guide agents by ranking candidate steps at test-time, existing PRMs, designed for short reasoning with binary judgment, cannot capture richer dimensions of information-seeking steps, such as tool interactions and reasoning over tool outputs, nor handle the rapidly growing context in long-horizon tasks. To address these limitations, we introduce PRInTS, a generative PRM trained with dual capabilities: (1) dense scoring based on the PRM's reasoning across multiple step quality dimensions (e.g., interpretation of tool outputs, tool call informativeness) and (2) trajectory summarization that compresses the growing context while preserving essential information for step evaluation. Extensive evaluations across FRAMES, GAIA (levels 1-3), and WebWalkerQA (easy-hard) benchmarks on multiple models, along with ablations, reveal that best-of-n sampling with PRInTS enhances information-seeking abilities of open-source models as well as specialized agents, matching or surpassing the performance of frontier models with a much smaller backbone agent and outperforming other strong reward modeling baselines.
comment: 18 pages, code: https://github.com/G-JWLee/PRInTS
☆ AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning
Humans naturally adapt to diverse environments by learning underlying rules across worlds with different dynamics, observations, and reward structures. In contrast, existing agents typically demonstrate improvements via self-evolving within a single domain, implicitly assuming a fixed environment distribution. Cross-environment learning has remained largely unmeasured: there is no standard collection of controllable, heterogeneous environments, nor a unified way to represent how agents learn. We address these gaps in two steps. First, we propose AutoEnv, an automated framework that treats environments as factorizable distributions over transitions, observations, and rewards, enabling low-cost (4.12 USD on average) generation of heterogeneous worlds. Using AutoEnv, we construct AutoEnv-36, a dataset of 36 environments with 358 validated levels, on which seven language models achieve 12-49% normalized reward, demonstrating the challenge of AutoEnv-36. Second, we formalize agent learning as a component-centric process driven by three stages of Selection, Optimization, and Evaluation applied to an improvable agent component. Using this formulation, we design eight learning methods and evaluate them on AutoEnv-36. Empirically, the gain of any single learning method quickly decrease as the number of environments increases, revealing that fixed learning methods do not scale across heterogeneous environments. Environment-adaptive selection of learning methods substantially improves performance but exhibits diminishing returns as the method space expands. These results highlight both the necessity and the current limitations of agent learning for scalable cross-environment generalization, and position AutoEnv and AutoEnv-36 as a testbed for studying cross-environment agent learning. The code is avaiable at https://github.com/FoundationAgents/AutoEnv.
☆ Open-weight genome language model safeguards: Assessing robustness via adversarial fine-tuning NeurIPS 2025
Novel deep learning architectures are increasingly being applied to biological data, including genetic sequences. These models, referred to as genomic language mod- els (gLMs), have demonstrated impressive predictive and generative capabilities, raising concerns that such models may also enable misuse, for instance via the generation of genomes for human-infecting viruses. These concerns have catalyzed calls for risk mitigation measures. The de facto mitigation of choice is filtering of pretraining data (i.e., removing viral genomic sequences from training datasets) in order to limit gLM performance on virus-related tasks. However, it is not currently known how robust this approach is for securing open-source models that can be fine-tuned using sensitive pathogen data. Here, we evaluate a state-of-the-art gLM, Evo 2, and perform fine-tuning using sequences from 110 harmful human-infecting viruses to assess the rescue of misuse-relevant predictive capabilities. The fine- tuned model exhibited reduced perplexity on unseen viral sequences relative to 1) the pretrained model and 2) a version fine-tuned on bacteriophage sequences. The model fine-tuned on human-infecting viruses also identified immune escape variants from SARS-CoV-2 (achieving an AUROC of 0.6), despite having no expo- sure to SARS-CoV-2 sequences during fine-tuning. This work demonstrates that data exclusion might be circumvented by fine-tuning approaches that can, to some degree, rescue misuse-relevant capabilities of gLMs. We highlight the need for safety frameworks for gLMs and outline further work needed on evaluations and mitigation measures to enable the safe deployment of gLMs.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Biosecurity Safeguards for Generative AI
☆ Data Flows and Colonial Regimes in Africa: A Critical Analysis of the Colonial Futurities Embedded in AI Ecosystems
This chapter seeks to frame the elemental and invisible problems of AI and big data in the African context by examining digital sites and infrastructure through the lens of power and interests. It will present reflections on how these sites are using AI recommendation algorithms to recreate new digital societies in the region, how they have the potential to propagate algorithmic colonialism and negative gender norms, and what this means for the regional sustainable development agenda. The chapter proposes adopting business models that embrace response-ability and consider the existence of alternative socio-material worlds of AI. These reflections will mainly come from ongoing discussions with Kenyan social media users in this authors' user space talks, personal experiences and six months of active participant observations done by the authors.
comment: 12 pages
☆ Dynamic Multi-Species Bird Soundscape Generation with Acoustic Patterning and 3D Spatialization
Generation of dynamic, scalable multi-species bird soundscapes remains a significant challenge in computer music and algorithmic sound design. Birdsongs involve rapid frequency-modulated chirps, complex amplitude envelopes, distinctive acoustic patterns, overlapping calls, and dynamic inter-bird interactions, all of which require precise temporal and spatial control in 3D environments. Existing approaches, whether Digital Signal Processing (DSP)-based or data-driven, typically focus only on single species modeling, static call structures, or synthesis directly from recordings, and often suffer from noise, limited flexibility, or large data needs. To address these challenges, we present a novel, fully algorithm-driven framework that generates dynamic multi-species bird soundscapes using DSP-based chirp generation and 3D spatialization, without relying on recordings or training data. Our approach simulates multiple independently-moving birds per species along different moving 3D trajectories, supporting controllable chirp sequences, overlapping choruses, and realistic 3D motion in scalable soundscapes while preserving species-specific acoustic patterns. A visualization interface provides bird trajectories, spectrograms, activity timelines, and sound waves for analytical and creative purposes. Both visual and audio evaluations demonstrate the ability of the system to generate dense, immersive, and ecologically inspired soundscapes, highlighting its potential for computer music, interactive virtual environments, and computational bioacoustics research.
comment: Accepted by IEEE Big Data 2025
☆ Interpreting GFlowNets for Drug Discovery: Extracting Actionable Insights for Medicinal Chemistry NeurIPS 2025
Generative Flow Networks, or GFlowNets, offer a promising framework for molecular design, but their internal decision policies remain opaque. This limits adoption in drug discovery, where chemists require clear and interpretable rationales for proposed structures. We present an interpretability framework for SynFlowNet, a GFlowNet trained on documented chemical reactions and purchasable starting materials that generates both molecules and the synthetic routes that produce them. Our approach integrates three complementary components. Gradient based saliency combined with counterfactual perturbations identifies which atomic environments influence reward and how structural edits change molecular outcomes. Sparse autoencoders reveal axis aligned latent factors that correspond to physicochemical properties such as polarity, lipophilicity, and molecular size. Motif probes show that functional groups including aromatic rings and halogens are explicitly encoded and linearly decodable from the internal embeddings. Together, these results expose the chemical logic inside SynFlowNet and provide actionable and mechanistic insight that supports transparent and controllable molecular design.
comment: 13 pages, 7 figures. Accepted for presentation at NeurIPS 2025 WiML Workshop and Molecular Machine Learning Conference (MoML) 2025
☆ Solar-GECO: Perovskite Solar Cell Property Prediction with Geometric-Aware Co-Attention NeurIPS 2025
Perovskite solar cells are promising candidates for next-generation photovoltaics. However, their performance as multi-scale devices is determined by complex interactions between their constituent layers. This creates a vast combinatorial space of possible materials and device architectures, making the conventional experimental-based screening process slow and expensive. Machine learning models try to address this problem, but they only focus on individual material properties or neglect the important geometric information of the perovskite crystal. To address this problem, we propose to predict perovskite solar cell power conversion efficiency with a geometric-aware co-attention (Solar-GECO) model. Solar-GECO combines a geometric graph neural network (GNN) - that directly encodes the atomic structure of the perovskite absorber - with language model embeddings that process the textual strings representing the chemical compounds of the transport layers and other device components. Solar-GECO also integrates a co-attention module to capture intra-layer dependencies and inter-layer interactions, while a probabilistic regression head predicts both power conversion efficiency (PCE) and its associated uncertainty. Solar-GECO achieves state-of-the-art performance, significantly outperforming several baselines, reducing the mean absolute error (MAE) for PCE prediction from 3.066 to 2.936 compared to semantic GNN (the previous state-of-the-art model). Solar-GECO demonstrates that integrating geometric and textual information provides a more powerful and accurate framework for PCE prediction.
comment: Accepted at the AI for Accelerated Materials Design (AI4Mat) Workshop at NeurIPS 2025. 14 pages, 4 figures
☆ Psychometric Tests for AI Agents and Their Moduli Space
We develop a moduli-theoretic view of psychometric test batteries for AI agents and connect it explicitly to the AAI score developed previously. First, we make precise the notion of an AAI functional on a battery and set out axioms that any reasonable autonomy/general intelligence score should satisfy. Second, we show that the composite index ('AAI-Index') defined previously is a special case of our AAI functional. Third, we introduce the notion of a cognitive core of an agent relative to a battery and define the associated AAI$_{\textrm{core}}$ score as the restriction of an AAI functional to that core. Finally, we use these notions to describe invariants of batteries under evaluation-preserving symmetries and outline how moduli of equivalent batteries are organized.
☆ A Nutrition Multimodal Photoplethysmography Language Model
Hunger and satiety dynamics shape dietary behaviors and metabolic health, yet remain difficult to capture in everyday settings. We present a Nutrition Photoplethysmography Language Model (NPLM), integrating continuous photoplethysmography (PPG) from wearables with meal descriptions. NPLM projects PPG into embeddings interpretable by language models, enabling joint reasoning over physiology and meal context. Trained on 19,340 participants and 1.1 million meal-PPG pairs, the model improved daily caloric intake prediction by 11% over text-only baselines, with accuracy maintained when 80% of meal text was removed. In an independent validation study (n=140) with controlled dining and detailed meal information, the model replicated these findings. These results demonstrate the value of integrating physiological measurements from consumer wearables with meal information for noninvasive dietary monitoring at scale.
comment: 21 pages, 2 figures
☆ Medusa: Cross-Modal Transferable Adversarial Attacks on Multimodal Medical Retrieval-Augmented Generation KDD 2026
With the rapid advancement of retrieval-augmented vision-language models, multimodal medical retrieval-augmented generation (MMed-RAG) systems are increasingly adopted in clinical decision support. These systems enhance medical applications by performing cross-modal retrieval to integrate relevant visual and textual evidence for tasks, e.g., report generation and disease diagnosis. However, their complex architecture also introduces underexplored adversarial vulnerabilities, particularly via visual input perturbations. In this paper, we propose Medusa, a novel framework for crafting cross-modal transferable adversarial attacks on MMed-RAG systems under a black-box setting. Specifically, Medusa formulates the attack as a perturbation optimization problem, leveraging a multi-positive InfoNCE loss (MPIL) to align adversarial visual embeddings with medically plausible but malicious textual targets, thereby hijacking the retrieval process. To enhance transferability, we adopt a surrogate model ensemble and design a dual-loop optimization strategy augmented with invariant risk minimization (IRM). Extensive experiments on two real-world medical tasks, including medical report generation and disease diagnosis, demonstrate that Medusa achieves over 90% average attack success rate across various generation models and retrievers under appropriate parameter configuration, while remaining robust against four mainstream defenses, outperforming state-of-the-art baselines. Our results reveal critical vulnerabilities in the MMed-RAG systems and highlight the necessity of robustness benchmarking in safety-critical medical applications. The code and data are available at https://anonymous.4open.science/r/MMed-RAG-Attack-F05A.
comment: Accepted at KDD 2026 First Cycle (full version). Authors marked with * contributed equally. Yi Liu is the lead author
☆ SimDiff: Simpler Yet Better Diffusion Model for Time Series Point Forecasting AAAI 2026
Diffusion models have recently shown promise in time series forecasting, particularly for probabilistic predictions. However, they often fail to achieve state-of-the-art point estimation performance compared to regression-based methods. This limitation stems from difficulties in providing sufficient contextual bias to track distribution shifts and in balancing output diversity with the stability and precision required for point forecasts. Existing diffusion-based approaches mainly focus on full-distribution modeling under probabilistic frameworks, often with likelihood maximization objectives, while paying little attention to dedicated strategies for high-accuracy point estimation. Moreover, other existing point prediction diffusion methods frequently rely on pre-trained or jointly trained mature models for contextual bias, sacrificing the generative flexibility of diffusion models. To address these challenges, we propose SimDiff, a single-stage, end-to-end framework. SimDiff employs a single unified Transformer network carefully tailored to serve as both denoiser and predictor, eliminating the need for external pre-trained or jointly trained regressors. It achieves state-of-the-art point estimation performance by leveraging intrinsic output diversity and improving mean squared error accuracy through multiple inference ensembling. Key innovations, including normalization independence and the median-of-means estimator, further enhance adaptability and stability. Extensive experiments demonstrate that SimDiff significantly outperforms existing methods in time series point forecasting.
comment: Accepted by AAAI 2026
☆ Adversarial Patch Attacks on Vision-Based Cargo Occupancy Estimation via Differentiable 3D Simulation
Computer vision systems are increasingly adopted in modern logistics operations, including the estimation of trailer occupancy for planning, routing, and billing. Although effective, such systems may be vulnerable to physical adversarial attacks, particularly adversarial patches that can be printed and placed on interior surfaces. In this work, we study the feasibility of such attacks on a convolutional cargo-occupancy classifier using fully simulated 3D environments. Using Mitsuba 3 for differentiable rendering, we optimize patch textures across variations in geometry, lighting, and viewpoint, and compare their effectiveness to a 2D compositing baseline. Our experiments demonstrate that 3D-optimized patches achieve high attack success rates, especially in a denial-of-service scenario (empty to full), where success reaches 84.94 percent. Concealment attacks (full to empty) prove more challenging but still reach 30.32 percent. We analyze the factors influencing attack success, discuss implications for the security of automated logistics pipelines, and highlight directions for strengthening physical robustness. To our knowledge, this is the first study to investigate adversarial patch attacks for cargo-occupancy estimation in physically realistic, fully simulated 3D scenes.
comment: 9 pages, 5 figures, 1 algorithm
☆ MAESTRO: Multi-Agent Environment Shaping through Task and Reward Optimization
Cooperative Multi-Agent Reinforcement Learning (MARL) faces two major design bottlenecks: crafting dense reward functions and constructing curricula that avoid local optima in high-dimensional, non-stationary environments. Existing approaches rely on fixed heuristics or use Large Language Models (LLMs) directly in the control loop, which is costly and unsuitable for real-time systems. We propose MAESTRO (Multi-Agent Environment Shaping through Task and Reward Optimization), a framework that moves the LLM outside the execution loop and uses it as an offline training architect. MAESTRO introduces two generative components: (i) a semantic curriculum generator that creates diverse, performance-driven traffic scenarios, and (ii) an automated reward synthesizer that produces executable Python reward functions adapted to evolving curriculum difficulty. These components guide a standard MARL backbone (MADDPG) without increasing inference cost at deployment. We evaluate MAESTRO on large-scale traffic signal control (Hangzhou, 16 intersections) and conduct controlled ablations. Results show that combining LLM-generated curricula with LLM-generated reward shaping yields improved performance and stability. Across four seeds, the full system achieves +4.0% higher mean return (163.26 vs. 156.93) and 2.2% better risk-adjusted performance (Sharpe 1.53 vs. 0.70) over a strong curriculum baseline. These findings highlight LLMs as effective high-level designers for cooperative MARL training.
comment: Preprint. 16 pages, 6 figures. Preliminary version; extended experiments and analysis forthcoming
☆ Neural Architecture Search for Quantum Autoencoders
In recent years, machine learning and deep learning have driven advances in domains such as image classification, speech recognition, and anomaly detection by leveraging multi-layer neural networks to model complex data. Simultaneously, quantum computing (QC) promises to address classically intractable problems via quantum parallelism, motivating research in quantum machine learning (QML). Among QML techniques, quantum autoencoders show promise for compressing high-dimensional quantum and classical data. However, designing effective quantum circuit architectures for quantum autoencoders remains challenging due to the complexity of selecting gates, arranging circuit layers, and tuning parameters. This paper proposes a neural architecture search (NAS) framework that automates the design of quantum autoencoders using a genetic algorithm (GA). By systematically evolving variational quantum circuit (VQC) configurations, our method seeks to identify high-performing hybrid quantum-classical autoencoders for data reconstruction without becoming trapped in local minima. We demonstrate effectiveness on image datasets, highlighting the potential of quantum autoencoders for efficient feature extraction within a noise-prone, near-term quantum era. Our approach lays a foundation for broader application of genetic algorithms to quantum architecture search, aiming for a robust, automated method that can adapt to varied data and hardware constraints.
☆ Local Entropy Search over Descent Sequences for Bayesian Optimization
Searching large and complex design spaces for a global optimum can be infeasible and unnecessary. A practical alternative is to iteratively refine the neighborhood of an initial design using local optimization methods such as gradient descent. We propose local entropy search (LES), a Bayesian optimization paradigm that explicitly targets the solutions reachable by the descent sequences of iterative optimizers. The algorithm propagates the posterior belief over the objective through the optimizer, resulting in a probability distribution over descent sequences. It then selects the next evaluation by maximizing mutual information with that distribution, using a combination of analytic entropy calculations and Monte-Carlo sampling of descent sequences. Empirical results on high-complexity synthetic objectives and benchmark problems show that LES achieves strong sample efficiency compared to existing local and global Bayesian optimization methods.
☆ SENTINEL: A Fully End-to-End Language-Action Model for Humanoid Whole Body Control
Existing humanoid control systems often rely on teleoperation or modular generation pipelines that separate language understanding from physical execution. However, the former is entirely human-driven, and the latter lacks tight alignment between language commands and physical behaviors. In this paper, we present SENTINEL, a fully end-to-end language-action model for humanoid whole-body control. We construct a large-scale dataset by tracking human motions in simulation using a pretrained whole body controller, combined with their text annotations. The model directly maps language commands and proprioceptive inputs to low-level actions without any intermediate representation. The model generates action chunks using flow matching, which can be subsequently refined by a residual action head for real-world deployment. Our method exhibits strong semantic understanding and stable execution on humanoid robots in both simulation and real-world deployment, and also supports multi-modal extensions by converting inputs into texts.
comment: 23 pages, 8 figures, 11 tables
☆ In Machina N400: Pinpointing Where a Causal Language Model Detects Semantic Violations
How and where does a transformer notice that a sentence has gone semantically off the rails? To explore this question, we evaluated the causal language model (phi-2) using a carefully curated corpus, with sentences that concluded plausibly or implausibly. Our analysis focused on the hidden states sampled at each model layer. To investigate how violations are encoded, we utilized two complementary probes. First, we conducted a per-layer detection using a linear probe. Our findings revealed that a simple linear decoder struggled to distinguish between plausible and implausible endings in the lowest third of the model's layers. However, its accuracy sharply increased in the middle blocks, reaching a peak just before the top layers. Second, we examined the effective dimensionality of the encoded violation. Initially, the violation widens the representational subspace, followed by a collapse after a mid-stack bottleneck. This might indicate an exploratory phase that transitions into rapid consolidation. Taken together, these results contemplate the idea of alignment with classical psycholinguistic findings in human reading, where semantic anomalies are detected only after syntactic resolution, occurring later in the online processing sequence.
comment: Accepted at AICS2025
☆ Learning Plug-and-play Memory for Guiding Video Diffusion Models
Diffusion Transformer(DiT) based video generation models have recently achieved impressive visual quality and temporal coherence, but they still frequently violate basic physical laws and commonsense dynamics, revealing a lack of explicit world knowledge. In this work, we explore how to equip them with a plug-and-play memory that injects useful world knowledge. Motivated by in-context memory in Transformer-based LLMs, we conduct empirical studies to show that DiT can be steered via interventions on its hidden states, and simple low-pass and high-pass filters in the embedding space naturally disentangle low-level appearance and high-level physical/semantic cues, enabling targeted guidance. Building on these observations, we propose a learnable memory encoder DiT-Mem, composed of stacked 3D CNNs, low-/high-pass filters, and self-attention layers. The encoder maps reference videos into a compact set of memory tokens, which are concatenated as the memory within the DiT self-attention layers. During training, we keep the diffusion backbone frozen, and only optimize the memory encoder. It yields a rather efficient training process on few training parameters (150M) and 10K data samples, and enables plug-and-play usage at inference time. Extensive experiments on state-of-the-art models demonstrate the effectiveness of our method in improving physical rule following and video fidelity. Our code and data are publicly released here: https://thrcle421.github.io/DiT-Mem-Web/.
☆ Are Large Vision Language Models Truly Grounded in Medical Images? Evidence from Italian Clinical Visual Question Answering
Large vision language models (VLMs) have achieved impressive performance on medical visual question answering benchmarks, yet their reliance on visual information remains unclear. We investigate whether frontier VLMs demonstrate genuine visual grounding when answering Italian medical questions by testing four state-of-the-art models: Claude Sonnet 4.5, GPT-4o, GPT-5-mini, and Gemini 2.0 flash exp. Using 60 questions from the EuropeMedQA Italian dataset that explicitly require image interpretation, we substitute correct medical images with blank placeholders to test whether models truly integrate visual and textual information. Our results reveal striking variability in visual dependency: GPT-4o shows the strongest visual grounding with a 27.9pp accuracy drop (83.2% [74.6%, 91.7%] to 55.3% [44.1%, 66.6%]), while GPT-5-mini, Gemini, and Claude maintain high accuracy with modest drops of 8.5pp, 2.4pp, and 5.6pp respectively. Analysis of model-generated reasoning reveals confident explanations for fabricated visual interpretations across all models, suggesting varying degrees of reliance on textual shortcuts versus genuine visual analysis. These findings highlight critical differences in model robustness and the need for rigorous evaluation before clinical deployment.
comment: Accepted at the Workshop on Multimodal Representation Learning for Healthcare (MMRL4H), EurIPS 2025
☆ Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization
Large Language Models (LLMs) have developed rapidly in web services, delivering unprecedented capabilities while amplifying societal risks. Existing works tend to focus on either isolated jailbreak attacks or static defenses, neglecting the dynamic interplay between evolving threats and safeguards in real-world web contexts. To mitigate these challenges, we propose ACE-Safety (Adversarial Co-Evolution for LLM Safety), a novel framework that jointly optimize attack and defense models by seamlessly integrating two key innovative procedures: (1) Group-aware Strategy-guided Monte Carlo Tree Search (GS-MCTS), which efficiently explores jailbreak strategies to uncover vulnerabilities and generate diverse adversarial samples; (2) Adversarial Curriculum Tree-aware Group Policy Optimization (AC-TGPO), which jointly trains attack and defense LLMs with challenging samples via curriculum reinforcement learning, enabling robust mutual improvement. Evaluations across multiple benchmarks demonstrate that our method outperforms existing attack and defense approaches, and provides a feasible pathway for developing LLMs that can sustainably support responsible AI ecosystems.
☆ CLASH: A Benchmark for Cross-Modal Contradiction Detection
Contradictory multimodal inputs are common in real-world settings, yet existing benchmarks typically assume input consistency and fail to evaluate cross-modal contradiction detection - a fundamental capability for preventing hallucinations and ensuring reliability. We introduce CLASH, a novel benchmark for multimodal contradiction detection, featuring COCO images paired with contradictory captions containing controlled object-level or attribute-level contradictions. The samples include targeted questions evaluated in both multiple-choice and open-ended formats. The benchmark provides an extensive fine-tuning set filtered through automated quality checks, alongside a smaller human-verified diagnostic set. Our analysis of state-of-the-art models reveals substantial limitations in recognizing cross-modal conflicts, exposing systematic modality biases and category-specific weaknesses. Furthermore, we empirically demonstrate that targeted fine-tuning on CLASH substantially enhances conflict detection capabilities.
comment: First two authors contributed equally
☆ Torsion-Space Diffusion for Protein Backbone Generation with Geometric Refinement
Designing new protein structures is fundamental to computational biology, enabling advances in therapeutic molecule discovery and enzyme engineering. Existing diffusion-based generative models typically operate in Cartesian coordinate space, where adding noise disrupts strict geometric constraints such as fixed bond lengths and angles, often producing physically invalid structures. To address this limitation, we propose a Torsion-Space Diffusion Model that generates protein backbones by denoising torsion angles, ensuring perfect local geometry by construction. A differentiable forward-kinematics module reconstructs 3D coordinates with fixed 3.8 Angstrom backbone bond lengths while a constrained post-processing refinement optimizes global compactness via Radius of Gyration (Rg) correction, without violating bond constraints. Experiments on standard PDB proteins demonstrate 100% bond-length accuracy and significantly improved structural compactness, reducing Rg error from 70% to 18.6% compared to Cartesian diffusion baselines. Overall, this hybrid torsion-diffusion plus geometric-refinement framework generates physically valid and compact protein backbones, providing a promising path toward full-atom protein generation.
comment: 5 pages, 4 figures
☆ LLM-Based Agentic Negotiation for 6G: Addressing Uncertainty Neglect and Tail-Event Risk
A critical barrier to the trustworthiness of sixth-generation (6G) agentic autonomous networks is the uncertainty neglect bias; a cognitive tendency for large language model (LLM)-powered agents to make high-stakes decisions based on simple averages while ignoring the tail risk of extreme events. This paper proposes an unbiased, risk-aware framework for agentic negotiation, designed to ensure robust resource allocation in 6G network slicing. Specifically, agents leverage Digital Twins (DTs) to predict full latency distributions, which are then evaluated using a formal framework from extreme value theory, namely, Conditional Value-at-Risk (CVaR). This approach fundamentally shifts the agent's objective from reasoning over the mean to reasoning over the tail, thereby building a statistically-grounded buffer against worst-case outcomes. Furthermore, our framework ensures full uncertainty awareness by requiring agents to quantify epistemic uncertainty -- confidence in their own DTs predictions -- and propagate this meta-verification to make robust decisions, preventing them from acting on unreliable data. We validate this framework in a 6G inter-slice negotiation use-case between an eMBB and a URLLC agent. The results demonstrate the profound failure of the biased, mean-based baseline, which consistently fails its SLAs with a 25\% rate. Our unbiased, CVaR-aware agent successfully mitigates this bias, eliminating SLA violations and reducing the URLLC and eMBB p99.999 latencies by around 11\%. We show this reliability comes at the rational and quantifiable cost of slightly reduced energy savings to 17\%, exposing the false economy of the biased approach. This work provides a concrete methodology for building the trustworthy autonomous systems required for 6G.
comment: Link to open-source non-commercial code available
☆ Information Physics of Intelligence: Unifying Logical Depth and Entropy under Thermodynamic Constraints
The rapid scaling of artificial intelligence models has revealed a fundamental tension between model capacity (storage) and inference efficiency (computation). While classical information theory focuses on transmission and storage limits, it lacks a unified physical framework to quantify the thermodynamic costs of generating information from compressed laws versus retrieving it from memory. In this paper, we propose a theoretical framework that treats information processing as an enabling mapping from ontological states to carrier states. We introduce a novel metric, Derivation Entropy, which quantifies the effective work required to compute a target state from a given logical depth. By analyzing the interplay between Shannon entropy (storage) and computational complexity (time/energy), we demonstrate the existence of a critical phase transition point. Below this threshold, memory retrieval is thermodynamically favorable; above it, generative computation becomes the optimal strategy. This "Energy-Time-Space" conservation law provides a physical explanation for the efficiency of generative models and offers a rigorous mathematical bound for designing next-generation, energy-efficient AI architectures. Our findings suggest that the minimization of Derivation Entropy is a governing principle for the evolution of both biological and artificial intelligence.
☆ EEG-VLM: A Hierarchical Vision-Language Model with Multi-Level Feature Alignment and Visually Enhanced Language-Guided Reasoning for EEG Image-Based Sleep Stage Prediction
Sleep stage classification based on electroencephalography (EEG) is fundamental for assessing sleep quality and diagnosing sleep-related disorders. However, most traditional machine learning methods rely heavily on prior knowledge and handcrafted features, while existing deep learning models still struggle to jointly capture fine-grained time-frequency patterns and achieve clinical interpretability. Recently, vision-language models (VLMs) have made significant progress in the medical domain, yet their performance remains constrained when applied to physiological waveform data, especially EEG signals, due to their limited visual understanding and insufficient reasoning capability. To address these challenges, we propose EEG-VLM, a hierarchical vision-language framework that integrates multi-level feature alignment with visually enhanced language-guided reasoning for interpretable EEG-based sleep stage classification. Specifically, a specialized visual enhancement module constructs high-level visual tokens from intermediate-layer features to extract rich semantic representations of EEG images. These tokens are further aligned with low-level CLIP features through a multi-level alignment mechanism, enhancing the VLM's image-processing capability. In addition, a Chain-of-Thought (CoT) reasoning strategy decomposes complex medical inference into interpretable logical steps, effectively simulating expert-like decision-making. Experimental results demonstrate that the proposed method significantly improves both the accuracy and interpretability of VLMs in EEG-based sleep stage classification, showing promising potential for automated and explainable EEG analysis in clinical settings.
☆ From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation
This paper introduces the retrieval-augmented framework for automatic fashion caption and hashtag generation, combining multi-garment detection, attribute reasoning, and Large Language Model (LLM) prompting. The system aims to produce visually grounded, descriptive, and stylistically interesting text for fashion imagery, overcoming the limitations of end-to-end captioners that have problems with attribute fidelity and domain generalization. The pipeline combines a YOLO-based detector for multi-garment localization, k-means clustering for dominant color extraction, and a CLIP-FAISS retrieval module for fabric and gender attribute inference based on a structured product index. These attributes, together with retrieved style examples, create a factual evidence pack that is used to guide an LLM to generate human-like captions and contextually rich hashtags. A fine-tuned BLIP model is used as a supervised baseline model for comparison. Experimental results show that the YOLO detector is able to obtain a mean Average Precision (mAP@0.5) of 0.71 for nine categories of garments. The RAG-LLM pipeline generates expressive attribute-aligned captions and achieves mean attribute coverage of 0.80 with full coverage at the 50% threshold in hashtag generation, whereas BLIP gives higher lexical overlap and lower generalization. The retrieval-augmented approach exhibits better factual grounding, less hallucination, and great potential for scalable deployment in various clothing domains. These results demonstrate the use of retrieval-augmented generation as an effective and interpretable paradigm for automated and visually grounded fashion content generation.
comment: Submitted to Expert Systems with Applications
☆ Uncertainty-Aware Deep Learning Framework for Remaining Useful Life Prediction in Turbofan Engines with Learned Aleatoric Uncertainty
Accurate Remaining Useful Life (RUL) prediction coupled with uncertainty quantification remains a critical challenge in aerospace prognostics. This research introduces a novel uncertainty-aware deep learning framework that learns aleatoric uncertainty directly through probabilistic modeling, an approach unexplored in existing CMAPSS-based literature. Our hierarchical architecture integrates multi-scale Inception blocks for temporal pattern extraction, bidirectional Long Short-Term Memory networks for sequential modeling, and a dual-level attention mechanism operating simultaneously on sensor and temporal dimensions. The innovation lies in the Bayesian output layer that predicts both mean RUL and variance, enabling the model to learn data-inherent uncertainty. Comprehensive preprocessing employs condition-aware clustering, wavelet denoising, and intelligent feature selection. Experimental validation on NASA CMAPSS benchmarks (FD001-FD004) demonstrates competitive overall performance with RMSE values of 16.22, 19.29, 16.84, and 19.98 respectively. Remarkably, our framework achieves breakthrough critical zone performance (RUL <= 30 cycles) with RMSE of 5.14, 6.89, 5.27, and 7.16, representing 25-40 percent improvements over conventional approaches and establishing new benchmarks for safety-critical predictions. The learned uncertainty provides well-calibrated 95 percent confidence intervals with coverage ranging from 93.5 percent to 95.2 percent, enabling risk-aware maintenance scheduling previously unattainable in CMAPSS literature.
comment: 10 pages, 2 figures, 3 tables. Submitted to arXiv
☆ On the Optimality of Discrete Object Naming: a Kinship Case Study
The structure of naming systems in natural languages hinges on a trade-off between high informativeness and low complexity. Prior work capitalizes on information theory to formalize these notions; however, these studies generally rely on two simplifications: (i) optimal listeners, and (ii) universal communicative need across languages. Here, we address these limitations by introducing an information-theoretic framework for discrete object naming systems, and we use it to prove that an optimal trade-off is achievable if and only if the listener's decoder is equivalent to the Bayesian decoder of the speaker. Adopting a referential game setup from emergent communication, and focusing on the semantic domain of kinship, we show that our notion of optimality is not only theoretically achievable but also emerges empirically in learned communication systems.
☆ AI Consciousness and Existential Risk
In AI, the existential risk denotes the hypothetical threat posed by an artificial system that would possess both the capability and the objective, either directly or indirectly, to eradicate humanity. This issue is gaining prominence in scientific debate due to recent technical advancements and increased media coverage. In parallel, AI progress has sparked speculation and studies about the potential emergence of artificial consciousness. The two questions, AI consciousness and existential risk, are sometimes conflated, as if the former entailed the latter. Here, I explain that this view stems from a common confusion between consciousness and intelligence. Yet these two properties are empirically and theoretically distinct. Arguably, while intelligence is a direct predictor of an AI system's existential threat, consciousness is not. There are, however, certain incidental scenarios in which consciousness could influence existential risk, in either direction. Consciousness could be viewed as a means towards AI alignment, thereby lowering existential risk; or, it could be a precondition for reaching certain capabilities or levels of intelligence, and thus positively related to existential risk. Recognizing these distinctions can help AI safety researchers and public policymakers focus on the most pressing issues.
☆ Physics-informed Neural Operator Learning for Nonlinear Grad-Shafranov Equation
As artificial intelligence emerges as a transformative enabler for fusion energy commercialization, fast and accurate solvers become increasingly critical. In magnetic confinement nuclear fusion, rapid and accurate solution of the Grad-Shafranov equation (GSE) is essential for real-time plasma control and analysis. Traditional numerical solvers achieve high precision but are computationally prohibitive, while data-driven surrogates infer quickly but fail to enforce physical laws and generalize poorly beyond training distributions. To address this challenge, we present a Physics-Informed Neural Operator (PINO) that directly learns the GSE solution operator, mapping shape parameters of last closed flux surface to equilibrium solutions for realistic nonlinear current profiles. Comprehensive benchmarking of five neural architectures identifies the novel Transformer-KAN (Kolmogorov-Arnold Network) Neural Operator (TKNO) as achieving highest accuracy (0.25% mean L2 relative error) under supervised training (only data-driven). However, all data-driven models exhibit large physics residuals, indicating poor physical consistency. Our unsupervised training can reduce the residuals by nearly four orders of magnitude through embedding physics-based loss terms without labeled data. Critically, semi-supervised learning--integrating sparse labeled data (100 interior points) with physics constraints--achieves optimal balance: 0.48% interpolation error and the most robust extrapolation performance (4.76% error, 8.9x degradation factor vs 39.8x for supervised models). Accelerated by TensorRT optimization, our models enable millisecond-level inference, establishing PINO as a promising pathway for next-generation fusion control systems.
comment: 42 pages, 17 figures, 8 tables,
☆ The Core in Max-Loss Non-Centroid Clustering Can Be Empty
We study core stability in non-centroid clustering under the max-loss objective, where each agent's loss is the maximum distance to other members of their cluster. We prove that for all $k\geq 3$ there exist metric instances with $n\ge 9$ agents, with $n$ divisible by $k$, for which no clustering lies in the $α$-core for any $α<2^{\frac{1}{5}}\sim 1.148$. The bound is tight for our construction. Using a computer-aided proof, we also identify a two-dimensional Euclidean point set whose associated lower bound is slightly smaller than that of our general construction. This is, to our knowledge, the first impossibility result showing that the core can be empty in non-centroid clustering under the max-loss objective.
☆ Extracting Robust Register Automata from Neural Networks over Data Sequences
Automata extraction is a method for synthesising interpretable surrogates for black-box neural models that can be analysed symbolically. Existing techniques assume a finite input alphabet, and thus are not directly applicable to data sequences drawn from continuous domains. We address this challenge with deterministic register automata (DRAs), which extend finite automata with registers that store and compare numeric values. Our main contribution is a framework for robust DRA extraction from black-box models: we develop a polynomial-time robustness checker for DRAs with a fixed number of registers, and combine it with passive and active automata learning algorithms. This combination yields surrogate DRAs with statistical robustness and equivalence guarantees. As a key application, we use the extracted automata to assess the robustness of neural networks: for a given sequence and distance metric, the DRA either certifies local robustness or produces a concrete counterexample. Experiments on recurrent neural networks and transformer architectures show that our framework reliably learns accurate automata and enables principled robustness evaluation. Overall, our results demonstrate that robust DRA extraction effectively bridges neural network interpretability and formal reasoning without requiring white-box access to the underlying network.
☆ EnfoPath: Energy-Informed Analysis of Generative Trajectories in Flow Matching
Flow-based generative models synthesize data by integrating a learned velocity field from a reference distribution to the target data distribution. Prior work has focused on endpoint metrics (e.g., fidelity, likelihood, perceptual quality) while overlooking a deeper question: what do the sampling trajectories reveal? Motivated by classical mechanics, we introduce kinetic path energy (KPE), a simple yet powerful diagnostic that quantifies the total kinetic effort along each generation path of ODE-based samplers. Through comprehensive experiments on CIFAR-10 and ImageNet-256, we uncover two key phenomena: ({i}) higher KPE predicts stronger semantic quality, indicating that semantically richer samples require greater kinetic effort, and ({ii}) higher KPE inversely correlates with data density, with informative samples residing in sparse, low-density regions. Together, these findings reveal that semantically informative samples naturally reside on the sparse frontier of the data distribution, demanding greater generative effort. Our results suggest that trajectory-level analysis offers a physics-inspired and interpretable framework for understanding generation difficulty and sample characteristics.
comment: EurIPS 2025 Workshop on Principles of Generative Modeling (PriGM)
☆ GraphMind: Theorem Selection and Conclusion Generation Framework with Dynamic GNN for LLM Reasoning
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, including multi-step reasoning such as mathematical proving. However, existing approaches often lack an explicit and dynamic mechanism to structurally represent and evolve intermediate reasoning states, which limits their ability to perform context-aware theorem selection and iterative conclusion generation. To address these challenges, we propose GraphMind, a novel dynamic graph-based framework that integrates the graph neural network (GNN) with LLMs to iteratively select theorems and generate intermediate conclusions for multi-step reasoning. Our method models the reasoning process as a heterogeneous evolving graph, where nodes represent conditions, theorems, and conclusions, while edges capture logical dependencies between nodes. By encoding the current reasoning state with GNN and leveraging semantic matching for theorem selection, our framework enables context-aware, interpretable, and structured reasoning in a closed-loop manner. Experiments on various question-answering (QA) datasets demonstrate that our proposed GraphMind method achieves consistent performance improvements and significantly outperforms existing baselines in multi-step reasoning, validating the effectiveness and generalizability of our approach.
☆ DynaMix: Generalizable Person Re-identification via Dynamic Relabeling and Mixed Data Sampling
Generalizable person re-identification (Re-ID) aims to recognize individuals across unseen cameras and environments. While existing methods rely heavily on limited labeled multi-camera data, we propose DynaMix, a novel method that effectively combines manually labeled multi-camera and large-scale pseudo-labeled single-camera data. Unlike prior works, DynaMix dynamically adapts to the structure and noise of the training data through three core components: (1) a Relabeling Module that refines pseudo-labels of single-camera identities on-the-fly; (2) an Efficient Centroids Module that maintains robust identity representations under a large identity space; and (3) a Data Sampling Module that carefully composes mixed data mini-batches to balance learning complexity and intra-batch diversity. All components are specifically designed to operate efficiently at scale, enabling effective training on millions of images and hundreds of thousands of identities. Extensive experiments demonstrate that DynaMix consistently outperforms state-of-the-art methods in generalizable person Re-ID.
☆ Mitigating Participation Imbalance Bias in Asynchronous Federated Learning
In Asynchronous Federated Learning (AFL), the central server immediately updates the global model with each arriving client's contribution. As a result, clients perform their local training on different model versions, causing information staleness (delay). In federated environments with non-IID local data distributions, this asynchronous pattern amplifies the adverse effect of client heterogeneity (due to different data distribution, local objectives, etc.), as faster clients contribute more frequent updates, biasing the global model. We term this phenomenon heterogeneity amplification. Our work provides a theoretical analysis that maps AFL design choices to their resulting error sources when heterogeneity amplification occurs. Guided by our analysis, we propose ACE (All-Client Engagement AFL), which mitigates participation imbalance through immediate, non-buffered updates that use the latest information available from all clients. We also introduce a delay-aware variant, ACED, to balance client diversity against update staleness. Experiments on different models for different tasks across diverse heterogeneity and delay settings validate our analysis and demonstrate the robust performance of our approaches.
☆ Understanding, Accelerating, and Improving MeanFlow Training
MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.
☆ Large Language Model-Assisted Planning of Electric Vehicle Charging Infrastructure with Real-World Case Study
The growing demand for electric vehicle (EV) charging infrastructure presents significant planning challenges, requiring efficient strategies for investment and operation to deliver cost-effective charging services. However, the potential benefits of EV charging assignment, particularly in response to varying spatial-temporal patterns of charging demand, remain under-explored in infrastructure planning. This paper proposes an integrated approach that jointly optimizes investment decisions and charging assignments while accounting for spatial-temporal demand dynamics and their interdependencies. To support efficient model development, we leverage a large language model (LLM) to assist in generating and refining the mathematical formulation from structured natural-language descriptions, significantly reducing the modeling burden. The resulting optimization model enables optimal joint decision-making for investment and operation. Additionally, we propose a distributed optimization algorithm based on the Alternating Direction Method of Multipliers (ADMM) to address computational complexity in high-dimensional scenarios, which can be executed on standard computing platforms. We validate our approach through a case study using 1.5 million real-world travel records from Chengdu, China, demonstrating a 30% reduction in total cost compared to a baseline without EV assignment.
☆ MedSAM3: Delving into Segment Anything with Medical Concepts
Medical image segmentation is fundamental for biomedical discovery. Existing methods lack generalizability and demand extensive, time-consuming manual annotation for new clinical application. Here, we propose MedSAM-3, a text promptable medical segmentation model for medical image and video segmentation. By fine-tuning the Segment Anything Model (SAM) 3 architecture on medical images paired with semantic conceptual labels, our MedSAM-3 enables medical Promptable Concept Segmentation (PCS), allowing precise targeting of anatomical structures via open-vocabulary text descriptions rather than solely geometric prompts. We further introduce the MedSAM-3 Agent, a framework that integrates Multimodal Large Language Models (MLLMs) to perform complex reasoning and iterative refinement in an agent-in-the-loop workflow. Comprehensive experiments across diverse medical imaging modalities, including X-ray, MRI, Ultrasound, CT, and video, demonstrate that our approach significantly outperforms existing specialist and foundation models. We will release our code and model at https://github.com/Joey-S-Liu/MedSAM3.
☆ CSD: Change Semantic Detection with only Semantic Change Masks for Damage Assessment in Conflict Zones
Accurately and swiftly assessing damage from conflicts is crucial for humanitarian aid and regional stability. In conflict zones, damaged zones often share similar architectural styles, with damage typically covering small areas and exhibiting blurred boundaries. These characteristics lead to limited data, annotation difficulties, and significant recognition challenges, including high intra-class similarity and ambiguous semantic changes. To address these issues, we introduce a pre-trained DINOv3 model and propose a multi-scale cross-attention difference siamese network (MC-DiSNet). The powerful visual representation capability of the DINOv3 backbone enables robust and rich feature extraction from bi-temporal remote sensing images. We also release a new Gaza-change dataset containing high-resolution satellite image pairs from 2023-2024 with pixel-level semantic change annotations. It is worth emphasizing that our annotations only include semantic pixels of changed areas. Unlike conventional semantic change detection (SCD), our approach eliminates the need for large-scale semantic annotations of bi-temporal images, instead focusing directly on the changed regions. We term this new task change semantic detection (CSD). The CSD task represents a direct extension of binary change detection (BCD). Due to the limited spatial extent of semantic regions, it presents greater challenges than traditional SCD tasks. We evaluated our method under the CSD framework on both the Gaza-Change and SECOND datasets. Experimental results demonstrate that our proposed approach effectively addresses the CSD task, and its outstanding performance paves the way for practical applications in rapid damage assessment across conflict zones.
☆ Life-IQA: Boosting Blind Image Quality Assessment through GCN-enhanced Layer Interaction and MoE-based Feature Decoupling
Blind image quality assessment (BIQA) plays a crucial role in evaluating and optimizing visual experience. Most existing BIQA approaches fuse shallow and deep features extracted from backbone networks, while overlooking the unequal contributions to quality prediction. Moreover, while various vision encoder backbones are widely adopted in BIQA, the effective quality decoding architectures remain underexplored. To address these limitations, this paper investigates the contributions of shallow and deep features to BIQA, and proposes a effective quality feature decoding framework via GCN-enhanced \underline{l}ayer\underline{i}nteraction and MoE-based \underline{f}eature d\underline{e}coupling, termed \textbf{(Life-IQA)}. Specifically, the GCN-enhanced layer interaction module utilizes the GCN-enhanced deepest-layer features as query and the penultimate-layer features as key, value, then performs cross-attention to achieve feature interaction. Moreover, a MoE-based feature decoupling module is proposed to decouple fused representations though different experts specialized for specific distortion types or quality dimensions. Extensive experiments demonstrate that Life-IQA shows more favorable balance between accuracy and cost than a vanilla Transformer decoder and achieves state-of-the-art performance on multiple BIQA benchmarks.The code is available at: \href{https://github.com/TANGLONG2/Life-IQA/tree/main}{\texttt{Life-IQA}}.
☆ OrdMoE: Preference Alignment via Hierarchical Expert Group Ranking in Multimodal Mixture-of-Experts LLMs
Preference learning has recently emerged as a pivotal strategy for post-training alignment of Multimodal Large Language Models (MLLMs). However, existing approaches predominantly rely on external human-annotated preference data, which is costly and labor-intensive to collect. In this work, we propose OrdMoE, a novel preference alignment framework that bypasses the reliance on external human preferences entirely by leveraging intrinsic signals within Mixture-of-Experts (MoE) architectures. Specifically, we observe that the router's expert selection scores implicitly encode a quality-aware ranking of responses (i.e. higher-scoring experts consistently generate higher-quality outputs). Building on this insight, OrdMoE constructs an internal preference hierarchy by grouping experts into ranked tiers based on their per-token routing scores and activating each tier separately to produce a sequence of responses with increasing quality. This yields a zero-cost, self-supervised preference ordering over generated responses, which can be directly optimized using standard preference learning objectives. Extensive experiments across multiple multimodal benchmarks demnstrate that OrdMoE significantly enhances both alignment and overall performance of multimodal Mixture-of-Experts LLMs, achieving competitive results without requiring any human-annotated preference data.
☆ Introducing Visual Scenes and Reasoning: A More Realistic Benchmark for Spoken Language Understanding
Spoken Language Understanding (SLU) consists of two sub-tasks: intent detection (ID) and slot filling (SF). Given its broad range of real-world applications, enhancing SLU for practical deployment is increasingly critical. Profile-based SLU addresses ambiguous user utterances by incorporating context awareness (CA), user profiles (UP), and knowledge graphs (KG) to support disambiguation, thereby advancing SLU research toward real-world applicability. However, existing SLU datasets still fall short in representing real-world scenarios. Specifically, (1) CA uses one-hot vectors for representation, which is overly idealized, and (2) models typically focuses solely on predicting intents and slot labels, neglecting the reasoning process that could enhance performance and interpretability. To overcome these limitations, we introduce VRSLU, a novel SLU dataset that integrates both Visual images and explicit Reasoning. For over-idealized CA, we use GPT-4o and FLUX.1-dev to generate images reflecting users' environments and statuses, followed by human verification to ensure quality. For reasoning, GPT-4o is employed to generate explanations for predicted labels, which are then refined by human annotators to ensure accuracy and coherence. Additionally, we propose an instructional template, LR-Instruct, which first predicts labels and then generates corresponding reasoning. This two-step approach helps mitigate the influence of reasoning bias on label prediction. Experimental results confirm the effectiveness of incorporating visual information and highlight the promise of explicit reasoning in advancing SLU.
☆ Enhancing low energy reconstruction and classification in KM3NeT/ORCA with transformers
The current KM3NeT/ORCA neutrino telescope, still under construction, has not yet reached its full potential in neutrino reconstruction capability. When training any deep learning model, no explicit information about the physics or the detector is provided, thus they remain unknown to the model. This study leverages the strengths of transformers by incorporating attention masks inspired by the physics and detector design, making the model understand both the telescope design and the neutrino physics measured on it. The study also shows the efficacy of transformers on retaining valuable information between detectors when doing fine-tuning from one configurations to another.
☆ Classification EM-PCA for clustering and embedding
The mixture model is undoubtedly one of the greatest contributions to clustering. For continuous data, Gaussian models are often used and the Expectation-Maximization (EM) algorithm is particularly suitable for estimating parameters from which clustering is inferred. If these models are particularly popular in various domains including image clustering, they however suffer from the dimensionality and also from the slowness of convergence of the EM algorithm. However, the Classification EM (CEM) algorithm, a classifying version, offers a fast convergence solution while dimensionality reduction still remains a challenge. Thus we propose in this paper an algorithm combining simultaneously and non-sequentially the two tasks --Data embedding and Clustering-- relying on Principal Component Analysis (PCA) and CEM. We demonstrate the interest of such approach in terms of clustering and data embedding. We also establish different connections with other clustering approaches.
comment: Accepted at the IEEE conference on Big Data (Special Session on Machine Learning)
☆ Rethinking Plant Disease Diagnosis: Bridging the Academic-Practical Gap with Vision Transformers and Zero-Shot Learning
Recent advances in deep learning have enabled significant progress in plant disease classification using leaf images. Much of the existing research in this field has relied on the PlantVillage dataset, which consists of well-centered plant images captured against uniform, uncluttered backgrounds. Although models trained on this dataset achieve high accuracy, they often fail to generalize to real-world field images, such as those submitted by farmers to plant diagnostic systems. This has created a significant gap between published studies and practical application requirements, highlighting the necessity of investigating and addressing this issue. In this study, we investigate whether attention-based architectures and zero-shot learning approaches can bridge the gap between curated academic datasets and real-world agricultural conditions in plant disease classification. We evaluate three model categories: Convolutional Neural Networks (CNNs), Vision Transformers, and Contrastive Language-Image Pre-training (CLIP)-based zero-shot models. While CNNs exhibit limited robustness under domain shift, Vision Transformers demonstrate stronger generalization by capturing global contextual features. Most notably, CLIP models classify diseases directly from natural language descriptions without any task-specific training, offering strong adaptability and interpretability. These findings highlight the potential of zero-shot learning as a practical and scalable domain adaptation strategy for plant health diagnosis in diverse field environments.
☆ Dynamic Mixture of Experts Against Severe Distribution Shifts
The challenge of building neural networks that can continuously learn and adapt to evolving data streams is central to the fields of continual learning (CL) and reinforcement learning (RL). This lifelong learning problem is often framed in terms of the plasticity-stability dilemma, focusing on issues like loss of plasticity and catastrophic forgetting. Unlike neural networks, biological brains maintain plasticity through capacity growth, inspiring researchers to explore similar approaches in artificial networks, such as adding capacity dynamically. Prior solutions often lack parameter efficiency or depend on explicit task indices, but Mixture-of-Experts (MoE) architectures offer a promising alternative by specializing experts for distinct distributions. This paper aims to evaluate a DynamicMoE approach for continual and reinforcement learning environments and benchmark its effectiveness against existing network expansion methods.
☆ MOCLIP: A Foundation Model for Large-Scale Nanophotonic Inverse Design
Foundation models (FM) are transforming artificial intelligence by enabling generalizable, data-efficient solutions across different domains for a broad range of applications. However, the lack of large and diverse datasets limits the development of FM in nanophotonics. This work presents MOCLIP (Metasurface Optics Contrastive Learning Pretrained), a nanophotonic foundation model that integrates metasurface geometry and spectra within a shared latent space. MOCLIP employs contrastive learning to align geometry and spectral representations using an experimentally acquired dataset with a sample density comparable to ImageNet-1K. The study demonstrates MOCLIP inverse design capabilities for high-throughput zero-shot prediction at a rate of 0.2 million samples per second, enabling the design of a full 4-inch wafer populated with high-density metasurfaces in minutes. It also shows generative latent-space optimization reaching 97 percent accuracy. Finally, we introduce an optical information storage concept that uses MOCLIP to achieve a density of 0.1 Gbit per square millimeter at the resolution limit, exceeding commercial optical media by a factor of six. These results position MOCLIP as a scalable and versatile platform for next-generation photonic design and data-driven applications.
☆ FastForward Pruning: Efficient LLM Pruning via Single-Step Reinforcement Learning
Pruning is an effective method for compressing Large Language Models, but finding an optimal, non-uniform layer-wise sparsity allocation remains a key challenge. While heuristic methods are fast but yield suboptimal performance, more powerful search-based approaches like Reinforcement Learning are often hindered by prohibitive computational costs on large-scale models. To overcome this efficiency barrier, we propose FastForward Pruning. Its core is a decoupled, single-step RL framework that separates policy optimization from the complex budget satisfaction problem. Such a decoupling is crucial for efficiently searching the vast policy space of LLMs. This curriculum-based strategy begins with low-cost, simple tasks and gradually increases in complexity, significantly reducing the search's computational overhead. Evaluated on the LLaMA, Mistral, and OPT model families, our framework discovers pruning policies that achieve superior performance over strong heuristic baselines. Crucially, when compared to other search-based algorithms, our method achieves competitive or superior results at a fraction of the computational cost, demonstrating a clear advantage in search efficiency.
comment: 5 pages, 2 figures, 4 tables
☆ LLM-CSEC: Empirical Evaluation of Security in C/C++ Code Generated by Large Language Models
The security of code generated by large language models (LLMs) is a significant concern, as studies indicate that such code often contains vulnerabilities and lacks essential defensive programming constructs. This work focuses on examining and evaluating the security of LLM-generated code, particularly in the context of C/C++. We categorized known vulnerabilities using the Common Weakness Enumeration (CWE) and, to study their criticality, mapped them to CVEs. We used ten different LLMs for code generation and analyzed the outputs through static analysis. The amount of CWEs present in AI-generated code is concerning. Our findings highlight the need for developers to be cautious when using LLM-generated code. This study provides valuable insights to advance automated code generation and encourage further research in this domain.
☆ Synthesizing Visual Concepts as Vision-Language Programs
Vision-Language models (VLMs) achieve strong performance on multimodal tasks but often fail at systematic visual reasoning tasks, leading to inconsistent or illogical outputs. Neuro-symbolic methods promise to address this by inducing interpretable logical rules, though they exploit rigid, domain-specific perception modules. We propose Vision-Language Programs (VLP), which combine the perceptual flexibility of VLMs with systematic reasoning of program synthesis. Rather than embedding reasoning inside the VLM, VLP leverages the model to produce structured visual descriptions that are compiled into neuro-symbolic programs. The resulting programs execute directly on images, remain consistent with task constraints, and provide human-interpretable explanations that enable easy shortcut mitigation. Experiments on synthetic and real-world datasets demonstrate that VLPs outperform direct and structured prompting, particularly on tasks requiring complex logical reasoning.
☆ Learning to Compress Graphs via Dual Agents for Consistent Topological Robustness Evaluation
As graph-structured data grow increasingly large, evaluating their robustness under adversarial attacks becomes computationally expensive and difficult to scale. To address this challenge, we propose to compress graphs into compact representations that preserve both topological structure and robustness profile, enabling efficient and reliable evaluation.We propose Cutter, a dual-agent reinforcement learning framework composed of a Vital Detection Agent (VDA) and a Redundancy Detection Agent (RDA), which collaboratively identify structurally vital and redundant nodes for guided compression. Cutter incorporates three key strategies to enhance learning efficiency and compression quality: trajectory-level reward shaping to transform sparse trajectory returns into dense, policy-equivalent learning signals; prototype-based shaping to guide decisions using behavioral patterns from both highand low-return trajectories; and cross-agent imitation to enable safer and more transferable exploration. Experiments on multiple real-world graphs demonstrate that Cutter generates compressed graphs that retain essential static topological properties and exhibit robustness degradation trends highly consistent with the original graphs under various attack scenarios, thereby significantly improving evaluation efficiency without compromising assessment fidelity.
☆ Active Inference is a Subtype of Variational Inference
Automated decision-making under uncertainty requires balancing exploitation and exploration. Classical methods treat these separately using heuristics, while Active Inference unifies them through Expected Free Energy (EFE) minimization. However, EFE minimization is computationally expensive, limiting scalability. We build on recent theory recasting EFE minimization as variational inference, formally unifying it with Planning-as-Inference and showing the epistemic drive as a unique entropic contribution. Our main contribution is a novel message-passing scheme for this unified objective, enabling scalable Active Inference in factored-state MDPs and overcoming high-dimensional planning intractability.
comment: Accepted to the EIML Workshop 2025 at EurIPS (non-archival)
☆ SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression
Large Language Models (LLMs) face a significant bottleneck during autoregressive inference due to the massive memory footprint of the Key-Value (KV) cache. Existing compression techniques like token eviction, quantization, or other low-rank methods often risk information loss, have fixed limits, or introduce significant computational overhead from explicit decompression steps. In this work, we introduce SWAN, a novel, fine-tuning-free framework that eliminates this overhead. Our method uses an offline orthogonal matrix to rotate and prune the KV-cache, which is then used directly in the attention computation without any reconstruction. Our extensive experiments demonstrate that SWAN, augmented with a small dense buffer, offers a robust trade-off, maintaining performance close to the uncompressed baseline even at aggressive 50-60% memory savings per-token on KV-cache. A key advantage is its runtime-tunable compression level, allowing operators to dynamically adjust the memory footprint, a flexibility absent in methods requiring fixed offline configurations. This combination of a decompression-free design, high performance under compression, and adaptability makes SWAN a practical and efficient solution for serving LLMs with long contexts.
☆ Skeletons Matter: Dynamic Data Augmentation for Text-to-Query EMNLP 2025
The task of translating natural language questions into query languages has long been a central focus in semantic parsing. Recent advancements in Large Language Models (LLMs) have significantly accelerated progress in this field. However, existing studies typically focus on a single query language, resulting in methods with limited generalizability across different languages. In this paper, we formally define the Text-to-Query task paradigm, unifying semantic parsing tasks across various query languages. We identify query skeletons as a shared optimization target of Text-to-Query tasks, and propose a general dynamic data augmentation framework that explicitly diagnoses model-specific weaknesses in handling these skeletons to synthesize targeted training data. Experiments on four Text-to-Query benchmarks demonstrate that our method achieves state-of-the-art performance using only a small amount of synthesized data, highlighting the efficiency and generality of our approach and laying a solid foundation for unified research on Text-to-Query tasks. We release our code at https://github.com/jjjycaptain/Skeletron.
comment: Accepted at EMNLP 2025
☆ Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations NeurIPS 2024
Large Language Models (LLMs) remain susceptible to jailbreak exploits that bypass safety filters and induce harmful or unethical behavior. This work presents a systematic taxonomy of existing jailbreak defenses across prompt-level, model-level, and training-time interventions, followed by three proposed defense strategies. First, a Prompt-Level Defense Framework detects and neutralizes adversarial inputs through sanitization, paraphrasing, and adaptive system guarding. Second, a Logit-Based Steering Defense reinforces refusal behavior through inference-time vector steering in safety-sensitive layers. Third, a Domain-Specific Agent Defense employs the MetaGPT framework to enforce structured, role-based collaboration and domain adherence. Experiments on benchmark datasets show substantial reductions in attack success rate, achieving full mitigation under the agent-based defense. Overall, this study highlights how jailbreaks pose a significant security threat to LLMs and identifies key intervention points for prevention, while noting that defense strategies often involve trade-offs between safety, performance, and scalability. Code is available at: https://github.com/Kuro0911/CS5446-Project
comment: 20 pages including appendix; technical report; NeurIPS 2024 style
☆ Look It Up: Analysing Internal Web Search Capabilities of Modern LLMs
Modern large language models integrate web search to provide real-time answers, yet it remains unclear whether they are efficiently calibrated to use search when it is actually needed. We introduce a benchmark evaluating both the necessity and effectiveness of web access across commercial models with no access to internal states or parameters. The dataset includes a static split of 783 temporally anchored questions answerable from pre-cutoff knowledge, aimed at testing whether models invoke search based on low internal confidence, and a dynamic split of 288 post-cutoff queries designed to test whether models recognise when search is required and retrieve updated information. Web access substantially improves static accuracy for GPT-5-mini and Claude Haiku 4.5, though confidence calibration worsens. On dynamic queries, both models frequently invoke search yet remain below 70 percent accuracy due to weak query formulation. Costs per accuracy-improving call remain low, but returns diminish once initial retrieval fails. Selective invocation helps, but models become overconfident and inconsistent after search. Overall, built-in web search meaningfully improves factual accuracy and can be invoked selectively, yet models remain overconfident, skip retrieval when it is essential, and falter once initial search queries underperform. Taken together, internal web search works better as a good low-latency verification layer than a reliable analytical tool, with clear room for improvement.
comment: 10 pages, 8 figures
☆ Learning Solution Operators for Partial Differential Equations via Monte Carlo-Type Approximation NeurIPS 2025
The Monte Carlo-type Neural Operator (MCNO) introduces a lightweight architecture for learning solution operators for parametric PDEs by directly approximating the kernel integral using a Monte Carlo approach. Unlike Fourier Neural Operators, MCNO makes no spectral or translation-invariance assumptions. The kernel is represented as a learnable tensor over a fixed set of randomly sampled points. This design enables generalization across multiple grid resolutions without relying on fixed global basis functions or repeated sampling during training. Experiments on standard 1D PDE benchmarks show that MCNO achieves competitive accuracy with low computational cost, providing a simple and practical alternative to spectral and graph-based neural operators.
comment: NeurIPS 2025 Workshop on Machine Learning and the Physical Sciences
☆ MoodBench 1.0: An Evaluation Benchmark for Emotional Companionship Dialogue Systems
With the rapid development of Large Language Models, dialogue systems are shifting from information tools to emotional companions, heralding the era of Emotional Companionship Dialogue Systems (ECDs) that provide personalized emotional support for users. However, the field lacks clear definitions and systematic evaluation standards for ECDs. To address this, we first propose a definition of ECDs with formal descriptions. Then, based on this theory and the design principle of "Ability Layer-Task Layer (three level)-Data Layer-Method Layer", we design and implement the first ECD evaluation benchmark - MoodBench 1.0. Through extensive evaluations of 30 mainstream models, we demonstrate that MoodBench 1.0 has excellent discriminant validity and can effectively quantify the differences in emotional companionship abilities among models. Furthermore, the results reveal current models' shortcomings in deep emotional companionship, guiding future technological optimization and significantly aiding developers in enhancing ECDs' user experience.
comment: 26 pages, 7 figures
☆ LLM-Driven Kernel Evolution: Automating Driver Updates in Linux
Linux kernel evolution breaks drivers through API/ABI changes, semantic shifts, and security-hardening updates. We introduce DRIVEBENCH, an executable corpus of kernel$\rightarrow$driver co-evolution cases, and AUTODRIVER, a closed-loop, LLM-driven system for automating driver maintenance. The system integrates prompt engineering, multi-agent collaboration, static analysis, and iterative validation to ensure that generated patches are not only syntactically correct but also functionally and semantically consistent with kernel conventions. The corpus spans v5.10-v6.10 with 235 validated cases drawn from 612 candidates. In evaluation across 55 cases, AUTODRIVER achieves 56.4% compilation success; QEMU-based boot verification indicates that compiled patches preserve driver initialization in most instances. By releasing DRIVEBENCH and tooling, we enable reproducible research and a practical route to continuous, safe co-evolution of drivers with the Linux kernel.
☆ Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation
Group Relative Policy Optimization (GRPO) has emerged as an effective and lightweight framework for post-training visual generative models. However, its performance is fundamentally limited by the ambiguity of textual visual correspondence: a single prompt may validly describe diverse visual outputs, and a single image or video may support multiple equally correct interpretations. This many to many relationship leads reward models to generate uncertain and weakly discriminative signals, causing GRPO to underutilize reliable feedback and overfit noisy ones. We introduce Bayesian Prior-Guided Optimization (BPGO), a novel extension of GRPO that explicitly models reward uncertainty through a semantic prior anchor. BPGO adaptively modulates optimization trust at two levels: inter-group Bayesian trust allocation emphasizes updates from groups consistent with the prior while down-weighting ambiguous ones, and intra-group prior-anchored renormalization sharpens sample distinctions by expanding confident deviations and compressing uncertain scores. Across both image and video generation tasks, BPGO delivers consistently stronger semantic alignment, enhanced perceptual fidelity, and faster convergence than standard GRPO and recent variants.
☆ How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining
Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.
☆ VADE: Variance-Aware Dynamic Sampling via Online Sample-Level Difficulty Estimation for Multimodal RL
Group-based policy optimization methods like GRPO and GSPO have become standard for training multimodal models, leveraging group-wise rollouts and relative advantage estimation. However, they suffer from a critical \emph{gradient vanishing} problem when all responses within a group receive identical rewards, causing advantage estimates to collapse and training signals to diminish. Existing attempts to mitigate this issue fall into two paradigms: filtering-based and sampling-based methods. Filtering-based methods first generate rollouts broadly and then retroactively filter out uninformative groups, leading to substantial computational overhead. Sampling-based methods proactively select effective samples before rollout but rely on static criteria or prior dataset knowledge, lacking real-time adaptability. To address these issues, we propose \textbf{VADE}, a \textbf{V}ariance-\textbf{A}ware \textbf{D}ynamic sampling framework via online sample-level difficulty \textbf{E}stimation. Our framework integrates three key components: online sample-level difficulty estimation using Beta distributions, a Thompson sampler that maximizes information gain through the estimated correctness probability, and a two-scale prior decay mechanism that maintains robust estimation under policy evolution. This three components design enables VADE to dynamically select the most informative samples, thereby amplifying training signals while eliminating extra rollout costs. Extensive experiments on multimodal reasoning benchmarks show that VADE consistently outperforms strong baselines in both performance and sample efficiency, while achieving a dramatic reduction in computational overhead. More importantly, our framework can serves as a plug-and-play component to be seamlessly integrated into existing group-based RL algorithms. Code and models are available at https://VADE-RL.github.io.
☆ MetaDCSeg: Robust Medical Image Segmentation via Meta Dynamic Center Weighting
Medical image segmentation is crucial for clinical applications, but it is frequently disrupted by noisy annotations and ambiguous anatomical boundaries, which lead to instability in model training. Existing methods typically rely on global noise assumptions or confidence-based sample selection, which inadequately mitigate the performance degradation caused by annotation noise, especially in challenging boundary regions. To address this issue, we propose MetaDCSeg, a robust framework that dynamically learns optimal pixel-wise weights to suppress the influence of noisy ground-truth labels while preserving reliable annotations. By explicitly modeling boundary uncertainty through a Dynamic Center Distance (DCD) mechanism, our approach utilizes weighted feature distances for foreground, background, and boundary centers, directing the model's attention toward hard-to-segment pixels near ambiguous boundaries. This strategy enables more precise handling of structural boundaries, which are often overlooked by existing methods, and significantly enhances segmentation performance. Extensive experiments across four benchmark datasets with varying noise levels demonstrate that MetaDCSeg consistently outperforms existing state-of-the-art methods.
☆ Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models NeurIPS 2025
Efficient deployment of small language models (SLMs) is essential for numerous real-world applications with stringent latency constraints. While previous work on SLM design has primarily focused on reducing the number of parameters to achieve parameter-optimal SLMs, parameter efficiency does not necessarily translate into proportional real-device speed-ups. This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training when real-device latency is the primary consideration. Specifically, we identify two central architectural factors: depth-width ratios and operator choices. The former is crucial for small-batch-size latency, while the latter affects both latency and large-batch-size throughput. In light of this, we first study latency-optimal depth-width ratios, with the key finding that although deep-thin models generally achieve better accuracy under the same parameter budget, they may not lie on the accuracy-latency trade-off frontier. Next, we explore emerging efficient attention alternatives to evaluate their potential as candidate building operators. Using the identified promising operators, we construct an evolutionary search framework to automatically discover latency-optimal combinations of these operators within hybrid SLMs, thereby advancing the accuracy-latency frontier. In addition to architectural improvements, we further enhance SLM training using a weight normalization technique that enables more effective weight updates and improves final convergence. Combining these methods, we introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs, e.g., achieving over +5.5% average accuracy, 1.3x/1.9x lower latency, and 18.7x/45.6x higher throughput compared to Qwen3-1.7B/0.6B, respectively.
comment: Accepted by NeurIPS 2025
☆ CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation ACL'25
Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training. Current studies attempt to mitigate this issue by modifying existing datasets or generating new ones from freshly collected information. However, these methods fall short of ensuring contamination-resilient evaluation, as they fail to fully eliminate pre-existing knowledge from models or preserve the semantic complexity of the original datasets. To address these limitations, we propose \textbf{CoreEval}, a \textbf{Co}ntamination-\textbf{re}silient \textbf{Eval}uation strategy for automatically updating data with real-world knowledge. This approach begins by extracting entity relationships from the original data and leveraging the GDELT database to retrieve relevant, up-to-date knowledge. The retrieved knowledge is then recontextualized and integrated with the original data, which is refined and restructured to ensure semantic coherence and enhanced task relevance. Ultimately, a robust data reflection mechanism is employed to iteratively verify and refine labels, ensuring consistency between the updated and original datasets. Extensive experiments on updated datasets validate the robustness of CoreEval, demonstrating its effectiveness in mitigating performance overestimation caused by data contamination.
comment: ACL'25
☆ Accelerating Reinforcement Learning via Error-Related Human Brain Signals
In this work, we investigate how implicit neural feed back can accelerate reinforcement learning in complex robotic manipulation settings. While prior electroencephalogram (EEG) guided reinforcement learning studies have primarily focused on navigation or low-dimensional locomotion tasks, we aim to understand whether such neural evaluative signals can improve policy learning in high-dimensional manipulation tasks involving obstacles and precise end-effector control. We integrate error related potentials decoded from offline-trained EEG classifiers into reward shaping and systematically evaluate the impact of human-feedback weighting. Experiments on a 7-DoF manipulator in an obstacle-rich reaching environment show that neural feedback accelerates reinforcement learning and, depending on the human-feedback weighting, can yield task success rates that at times exceed those of sparse-reward baselines. Moreover, when applying the best-performing feedback weighting across all sub jects, we observe consistent acceleration of reinforcement learning relative to the sparse-reward setting. Furthermore, leave-one subject-out evaluations confirm that the proposed framework remains robust despite the intrinsic inter-individual variability in EEG decodability. Our findings demonstrate that EEG-based reinforcement learning can scale beyond locomotion tasks and provide a viable pathway for human-aligned manipulation skill acquisition.
☆ GContextFormer: A global context-aware hybrid multi-head attention approach with scaled additive aggregation for multimodal trajectory prediction
Multimodal trajectory prediction generates multiple plausible future trajectories to address vehicle motion uncertainty from intention ambiguity and execution variability. However, HD map-dependent models suffer from costly data acquisition, delayed updates, and vulnerability to corrupted inputs, causing prediction failures. Map-free approaches lack global context, with pairwise attention over-amplifying straight patterns while suppressing transitional patterns, resulting in motion-intention misalignment. This paper proposes GContextFormer, a plug-and-play encoder-decoder architecture with global context-aware hybrid attention and scaled additive aggregation achieving intention-aligned multimodal prediction without map reliance. The Motion-Aware Encoder builds scene-level intention prior via bounded scaled additive aggregation over mode-embedded trajectory tokens and refines per-mode representations under shared global context, mitigating inter-mode suppression and promoting intention alignment. The Hierarchical Interaction Decoder decomposes social reasoning into dual-pathway cross-attention: a standard pathway ensures uniform geometric coverage over agent-mode pairs while a neighbor-context-enhanced pathway emphasizes salient interactions, with gating module mediating their contributions to maintain coverage-focus balance. Experiments on eight highway-ramp scenarios from TOD-VT dataset show GContextFormer outperforms state-of-the-art baselines. Compared to existing transformer models, GContextFormer achieves greater robustness and concentrated improvements in high-curvature and transition zones via spatial distributions. Interpretability is achieved through motion mode distinctions and neighbor context modulation exposing reasoning attribution. The modular architecture supports extensibility toward cross-domain multimodal reasoning tasks. Source: https://fenghy-chen.github.io/sources/.
☆ Periodic Asynchrony: An Effective Method for Accelerating On-Policy Reinforcement Learning
Since the introduction of the GRPO algorithm, reinforcement learning (RL) has attracted increasing attention, with growing efforts to reproduce and apply it. However, training efficiency remains a critical challenge. In mainstream RL frameworks, inference and training are typically deployed on the same devices. While this approach reduces costs through resource consolidation, its synchronous execution imposes a computational coupling that prevents concurrent inference and training. In this study, we are returning to the strategy of separating inference and training deployment, and by introducing improvements in the data loader, we transform the conventional synchronous architecture into a periodically asynchronous framework, which allows for demand-driven, independent, and elastic scaling of each component, while the accuracy of the algorithm remains completely equivalent to the synchronization method, with both belonging to the on-policy strategy. It is worth emphasizing that we apply a unified tri-model architecture in the training phase, and we also proposed a shared-prompt attention mask to reduce repetitive computation. In practice, these works have achieved at least a threefold overall performance improvement in RL training on NPU platforms, indicating its potential for widespread application.
☆ Multidimensional Music Aesthetic Evaluation via Semantically Consistent C-Mixup Augmentation
Evaluating the aesthetic quality of generated songs is challenging due to the multi-dimensional nature of musical perception. We propose a robust music aesthetic evaluation framework that combines (1) multi-source multi-scale feature extraction to obtain complementary segment- and track-level representations, (2) a hierarchical audio augmentation strategy to enrich training data, and (3) a hybrid training objective that integrates regression and ranking losses for accurate scoring and reliable top-song identification. Experiments on the ICASSP 2026 SongEval benchmark demonstrate that our approach consistently outperforms baseline methods across correlation and top-tier metrics.
☆ KernelBand: Boosting LLM-based Kernel Optimization with a Hierarchical and Hardware-aware Multi-armed Bandit
High quality kernels are critical for reducing training and inference costs of Large Language Models (LLMs), yet they traditionally require significant expertise in hardware architecture and software optimization. While recent advances in LLM-based code generation show promise for complex optimization, existing methods struggle with the vast optimization space due to insufficient hardware domain knowledge, failing to effectively balance exploration and exploitation. We present KernelBand, a novel framework that formulates kernel optimization as a hierarchical multi-armed bandit problem, enabling LLM agents to strategically navigate the optimization space by treating kernel selection and optimization strategy application as sequential decision-making processes. Our approach leverages hardware profiling information to identify promising optimization strategies and employs runtime behavior clustering to reduce exploration overhead across kernel candidates. Extensive experiments on TritonBench demonstrate that KernelBand significantly outperforms state-of-the-art methods, achieving superior performance with fewer tokens while exhibiting consistent improvement without saturation as computational resources increase.
comment: Work in progress
☆ Generating Reading Comprehension Exercises with Large Language Models for Educational Applications
With the rapid development of large language models (LLMs), the applications of LLMs have grown substantially. In the education domain, LLMs demonstrate significant potential, particularly in automatic text generation, which enables the creation of intelligent and adaptive learning content. This paper proposes a new LLMs framework, which is named as Reading Comprehension Exercise Generation (RCEG). It can generate high-quality and personalized English reading comprehension exercises automatically. Firstly, RCEG uses fine-tuned LLMs to generate content candidates. Then, it uses a discriminator to select the best candidate. Finally, the quality of the generated content has been improved greatly. To evaluate the performance of RCEG, a dedicated dataset for English reading comprehension is constructed to perform the experiments, and comprehensive evaluation metrics are used to analyze the experimental results. These metrics include content diversity, factual accuracy, linguistic toxicity, and pedagogical alignment. Experimental results show that RCEG significantly improves the relevance and cognitive appropriateness of the generated exercises.
☆ Deep Hybrid Model for Region of Interest Detection in Omnidirectional Videos
The main goal of the project is to design a new model that predicts regions of interest in 360$^{\circ}$ videos. The region of interest (ROI) plays an important role in 360$^{\circ}$ video streaming. For example, ROIs are used to predict view-ports, intelligently cut the videos for live streaming, etc so that less bandwidth is used. Detecting view-ports in advance helps reduce the movement of the head while streaming and watching a video via the head-mounted device. Whereas, intelligent cuts of the videos help improve the efficiency of streaming the video to users and enhance the quality of their viewing experience. This report illustrates the secondary task to identify ROIs, in which, we design, train, and test a hybrid saliency model. In this work, we refer to saliency regions to represent the regions of interest. The method includes the processes as follows: preprocessing the video to obtain frames, developing a hybrid saliency model for predicting the region of interest, and finally post-processing the output predictions of the hybrid saliency model to obtain the output region of interest for each frame. Then, we compare the performance of the proposed method with the subjective annotations of the 360RAT dataset.
☆ Time Travel: LLM-Assisted Semantic Behavior Localization with Git Bisect SC
We present a novel framework that integrates Large Language Models (LLMs) into the Git bisect process for semantic fault localization. Traditional bisect assumes deterministic predicates and binary failure states assumptions often violated in modern software development due to flaky tests, nonmonotonic regressions, and semantic divergence from upstream repositories. Our system augments bisect traversal with structured chain of thought reasoning, enabling commit by commit analysis under noisy conditions. We evaluate multiple open source and proprietary LLMs for their suitability and fine tune DeepSeekCoderV2 using QLoRA on a curated dataset of semantically labeled diffs. We adopt a weak supervision workflow to reduce annotation overhead, incorporating human in the loop corrections and self consistency filtering. Experiments across multiple open source projects show a 6.4 point absolute gain in success rate from 74.2 to 80.6 percent, leading to significantly fewer failed traversals and by experiment up to 2x reduction in average bisect time. We conclude with discussions on temporal reasoning, prompt design, and finetuning strategies tailored for commit level behavior analysis.
comment: submitted to Git Bisect SCALCOM 2025 Calgary (to be published)
☆ Pre-Filtering Code Suggestions using Developer Behavioral Telemetry to Optimize LLM-Assisted Programming
Large Language Models (LLMs) are increasingly integrated into code editors to provide AI-powered code suggestions. Yet many of these suggestions are ignored, resulting in wasted computation, increased latency, and unnecessary interruptions. We introduce a lightweight pre-filtering model that predicts the likelihood of suggestion acceptance before invoking the LLM, using only real-time developer telemetry such as typing speed, file navigation, and editing activity. Deployed in a production-grade Visual Studio Code plugin over four months of naturalistic use, our approach nearly doubled acceptance rates (18.4% -> 34.2%) while suppressing 35% of low-value LLM calls. These findings demonstrate that behavioral signals alone can meaningfully improve both user experience and system efficiency in LLM-assisted programming, highlighting the value of timing-aware, privacy-preserving adaptation mechanisms. The filter operates solely on pre-invocation editor telemetry and never inspects code or prompts.
comment: \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses
♻ ☆ Cognitive Foundations for Reasoning and Their Manifestation in LLMs
Large language models (LLMs) solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning. To understand this gap, we synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning reasoning invariants, meta-cognitive controls, representations for organizing reasoning & knowledge, and transformation operations. We introduce a fine-grained evaluation framework and conduct the first large-scale empirical analysis of 192K traces from 18 models across text, vision, and audio, complemented by 54 human think-aloud traces, which we make publicly available. We find that models under-utilize cognitive elements correlated with success, narrowing to rigid sequential processing on ill-structured problems where diverse representations and meta-cognitive monitoring are critical. Human traces show more abstraction and conceptual processing, while models default to surface-level enumeration. Meta-analysis of 1.6K LLM reasoning papers reveals the research community concentrates on easily quantifiable elements (sequential organization: 55%, decomposition: 60%) but neglecting meta-cognitive controls (self-awareness: 16%) that correlate with success. Models possess behavioral repertoires associated with success but fail to deploy them spontaneously. Leveraging these patterns, we develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 66.7% on complex problems. By establishing a shared vocabulary between cognitive science and LLM research, our framework enables systematic diagnosis of reasoning failures and principled development of models that reason through robust cognitive mechanisms rather than spurious shortcuts, while providing tools to test theories of human cognition at scale.
comment: 40 pages, 4 tables, 6 figures
♻ ☆ The Loss of Control Playbook: Degrees, Dynamics, and Preparedness
This research report addresses the absence of an actionable definition for Loss of Control (LoC) in AI systems by developing a novel taxonomy and preparedness framework. Despite increasing policy and research attention, existing LoC definitions vary significantly in scope and timeline, hindering effective LoC assessment and mitigation. To address this issue, we draw from an extensive literature review and propose a graded LoC taxonomy, based on the metrics of severity and persistence, that distinguishes between Deviation, Bounded LoC, and Strict LoC. We model pathways toward a societal state of vulnerability in which sufficiently advanced AI systems have acquired or could acquire the means to cause Bounded or Strict LoC once a catalyst, either misalignment or pure malfunction, materializes. We argue that this state becomes increasingly likely over time, absent strategic intervention, and propose a strategy to avoid reaching a state of vulnerability. Rather than focusing solely on intervening on AI capabilities and propensities potentially relevant for LoC or on preventing potential catalysts, we introduce a complementary framework that emphasizes three extrinsic factors: Deployment context, Affordances, and Permissions (the DAP framework). Compared to work on intrinsic factors and catalysts, this framework has the unfair advantage of being actionable today. Finally, we put forward a plan to maintain preparedness and prevent the occurrence of LoC outcomes should a state of societal vulnerability be reached, focusing on governance measures (threat modeling, deployment policies, emergency response) and technical controls (pre-deployment testing, control measures, monitoring) that could maintain a condition of perennial suspension.
♻ ☆ Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models NeurIPS 2025
Robust coordination is critical for effective decision-making in multi-agent systems, especially under partial observability. A central question in Multi-Agent Reinforcement Learning (MARL) is whether to engineer communication protocols or learn them end-to-end. We investigate this dichotomy using embodied world models. We propose and compare two communication strategies for a cooperative task-allocation problem. The first, Learned Direct Communication (LDC), learns a protocol end-to-end. The second, Intention Communication, uses an engineered inductive bias: a compact, learned world model, the Imagined Trajectory Generation Module (ITGM), which uses the agent's own policy to simulate future states. A Message Generation Network (MGN) then compresses this plan into a message. We evaluate these approaches on goal-directed interaction in a grid world, a canonical abstraction for embodied AI problems, while scaling environmental complexity. Our experiments reveal that while emergent communication is viable in simple settings, the engineered, world model-based approach shows superior performance, sample efficiency, and scalability as complexity increases. These findings advocate for integrating structured, predictive models into MARL agents to enable active, goal-driven coordination.
comment: Published in the Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Scaling Environments for Agents (SEA). Additionally accepted for presentation in the NeurIPS 2025 Workshop: Embodied World Models for Decision Making (EWM) and the NeurIPS 2025 Workshop: Optimization for Machine Learning (OPT)
♻ ☆ Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics
Learning robust and generalizable world models is crucial for enabling efficient and scalable robotic control in real-world environments. In this work, we introduce a novel framework for learning world models that accurately capture complex, partially observable, and stochastic dynamics. The proposed method employs a dual-autoregressive mechanism and self-supervised training to achieve reliable long-horizon predictions without relying on domain-specific inductive biases, ensuring adaptability across diverse robotic tasks. We further propose a policy optimization framework that leverages world models for efficient training in imagined environments and seamless deployment in real-world systems. This work advances model-based reinforcement learning by addressing the challenges of long-horizon prediction, error accumulation, and sim-to-real transfer. By providing a scalable and robust framework, the introduced methods pave the way for adaptive and efficient robotic systems in real-world applications.
♻ ☆ Bridging LLM Planning Agents and Formal Methods: A Case Study in Plan Verification
We introduce a novel framework for evaluating the alignment between natural language plans and their expected behavior by converting them into Kripke structures and Linear Temporal Logic (LTL) using Large Language Models (LLMs) and performing model checking. We systematically evaluate this framework on a simplified version of the PlanBench plan verification dataset and report on metrics like Accuracy, Precision, Recall and F1 scores. Our experiments demonstrate that GPT-5 achieves excellent classification performance (F1 score of 96.3%) while almost always producing syntactically perfect formal representations that can act as guarantees. However, the synthesis of semantically perfect formal models remains an area for future exploration.
comment: Accepted to AgenticSE Workshop at ASE 2025
♻ ☆ ALMAS: an Autonomous LLM-based Multi-Agent Software Engineering Framework
Multi-agent Large Language Model (LLM) systems have been leading the way in applied LLM research across a number of fields. One notable area is software development, where researchers have advanced the automation of code implementation, code testing, code maintenance, inter alia, using LLM agents. However, software development is a multifaceted environment that extends beyond just code. As such, a successful LLM system must factor in multiple stages of the software development life-cycle (SDLC). In this paper, we propose a vision for ALMAS, an Autonomous LLM-based Multi-Agent Software Engineering framework, which follows the above SDLC philosophy such that it may work within an agile software development team to perform several tasks end-to-end. ALMAS aligns its agents with agile roles, and can be used in a modular fashion to seamlessly integrate with human developers and their development environment. We showcase the progress towards ALMAS through our published works and a use case demonstrating the framework, where ALMAS is able to seamlessly generate an application and add a new feature.
comment: Accepted to MAS-GAIN Workshop at ASE 2025
♻ ☆ Node Preservation and its Effect on Crossover in Cartesian Genetic Programming
While crossover is a critical and often indispensable component in other forms of Genetic Programming, such as Linear- and Tree-based, it has consistently been claimed that it deteriorates search performance in CGP. As a result, a mutation-alone $(1+λ)$ evolutionary strategy has become the canonical approach for CGP. Although several operators have been developed that demonstrate an increased performance over the canonical method, a general solution to the problem is still lacking. In this paper, we compare basic crossover methods, namely one-point and uniform, to variants in which nodes are ``preserved,'' including the subgraph crossover developed by Roman Kalkreuth, the difference being that when ``node preservation'' is active, crossover is not allowed to break apart instructions. We also compare a node mutation operator to the traditional point mutation; the former simply replaces an entire node with a new one. We find that node preservation in both mutation and crossover improves search using symbolic regression benchmark problems, moving the field towards a general solution to CGP crossover.
comment: Draft to cite in another paper before both papers are peer-reviewed for the evo*2026 conference, 21 pages, 5 figures
♻ ☆ Entropic Time Schedulers for Generative Diffusion Models
The practical performance of generative diffusion models depends on the appropriate choice of the noise scheduling function, which can also be equivalently expressed as a time reparameterization. In this paper, we present a time scheduler that selects sampling points based on entropy rather than uniform time spacing, ensuring that each point contributes an equal amount of information to the final generation. We prove that this time reparameterization does not depend on the initial choice of time. Furthermore, we provide a tractable exact formula to estimate this \emph{entropic time} for a trained model using the training loss without substantial overhead. Alongside the entropic time, inspired by the optimality results, we introduce a rescaled entropic time. In our experiments with mixtures of Gaussian distributions and ImageNet, we show that using the (rescaled) entropic times greatly improves the inference performance of trained models. In particular, we found that the image quality in pretrained EDM2 models, as evaluated by FID and FD-DINO scores, can be substantially increased by the rescaled entropic time reparameterization without increasing the number of function evaluations, with greater improvements in the few NFEs regime. Code is available at https://github.com/DejanStancevic/Entropic-Time-Schedulers-for-Generative-Diffusion-Models.
comment: 31 pages
♻ ☆ The Geometry of Cortical Computation: Manifold Disentanglement and Predictive Dynamics in VCNet NeurIPS 2025
Despite their success, modern convolutional neural networks (CNNs) exhibit fundamental limitations, including data inefficiency, poor out-of-distribution generalization, and vulnerability to adversarial perturbations. These shortcomings can be traced to a lack of inductive biases that reflect the inherent geometric structure of the visual world. The primate visual system, in contrast, demonstrates superior efficiency and robustness, suggesting that its architectural and computational principles,which evolved to internalize these structures,may offer a blueprint for more capable artificial vision. This paper introduces Visual Cortex Network (VCNet), a novel neural network architecture whose design is informed by the macro-scale organization of the primate visual cortex. VCNet is framed as a geometric framework that emulates key biological mechanisms, including hierarchical processing across distinct cortical areas, dual-stream information segregation for learning disentangled representations, and top-down predictive feedback for representation refinement. We interpret these mechanisms through the lens of geometry and dynamical systems, positing that they guide the learning of structured, low-dimensional neural manifolds. We evaluate VCNet on two specialized benchmarks: the Spots-10 animal pattern dataset, which probes sensitivity to natural textures, and a light field image classification task, which requires processing higher-dimensional visual data. Our results show that VCNet achieves state-of-the-art accuracy of 92.1\% on Spots-10 and 74.4\% on the light field dataset, surpassing contemporary models of comparable size. This work demonstrates that integrating high-level neuroscientific principles, viewed through a geometric lens, can lead to more efficient and robust models, providing a promising direction for addressing long-standing challenges in machine learning.
comment: Published in the proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Symmetry and Geometry in Neural Representations (NeurReps). Additionally accepted for presentation in NeurIPS 2025 Workshop: Interpreting Cognition in Deep Learning Models (CogInterp)
♻ ☆ How does Alignment Enhance LLMs' Multilingual Capabilities? A Language Neurons Perspective AAAI 2026
Multilingual Alignment is an effective and representative paradigm to enhance LLMs' multilingual capabilities, which transfers the capabilities from the high-resource languages to the low-resource languages. Meanwhile, some research on language-specific neurons provides a new perspective to analyze and understand LLMs' mechanisms. However, we find that there are many neurons that are shared by multiple but not all languages and cannot be correctly classified. In this work, we propose a ternary classification methodology that categorizes neurons into three types, including language-specific neurons, language-related neurons, and general neurons. And we propose a corresponding identification algorithm to distinguish these different types of neurons. Furthermore, based on the distributional characteristics of different types of neurons, we divide the LLMs' internal process for multilingual inference into four parts: (1) multilingual understanding, (2) shared semantic space reasoning, (3) multilingual output space transformation, and (4) vocabulary space outputting. Additionally, we systematically analyze the models before and after alignment with a focus on different types of neurons. We also analyze the phenomenon of ''Spontaneous Multilingual Alignment''. Overall, our work conducts a comprehensive investigation based on different types of neurons, providing empirical results and valuable insights to better understand multilingual alignment and multilingual capabilities of LLMs.
comment: AAAI 2026 (Oral)
♻ ☆ The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification
Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity - leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in ~46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multianimal tracking in the wild. The dataset is available at https://www.conservationxlabs.com/sa-fari.
♻ ☆ AI and the Net-Zero Journey: Energy Demand, Emissions, and the Potential for Transition
Thanks to the availability of massive amounts of data, computing resources, and advanced algorithms, AI has entered nearly every sector. This has sparked significant investment and interest, particularly in building data centers with the necessary hardware and software to develop and operate AI models and AI-based workflows. In this technical review article, we present energy consumption scenarios of data centers and impact on GHG emissions, considering both near-term projections (up to 2030) and long-term outlook (2035 and beyond). We address the quintessential question of whether AI will have a net positive, neutral, or negative impact on CO2 emissions by 2035. Additionally, we discuss AI's potential to automate, create efficient and disruptive workflows across various fields related to energy production, supply and consumption. In the near-term scenario, the growing demand for AI will likely strain computing resources, lead to increase in electricity consumption and therefore associated CO2 emissions. This is due to the power-hungry nature of big data centers and the requirements for training and running of large and complex AI models, as well as the penetration of AI assistant search and applications for public use. However, the long-term outlook could be more promising. AI has the potential to be a game-changer in CO2 reduction. Its ability to further automate and optimize processes across industries, from energy production to logistics, could significantly decrease our carbon footprint. This positive impact is anticipated to outweigh the initial emissions bump, creating value for businesses and society in areas where traditional solutions have fallen short. In essence, AI might cause some initial growing pains for the environment, but it has the potential to support climate mitigation efforts.
comment: Technical article to be submitted to Data Centric Engineering Journal
♻ ☆ FOCUS: Efficient Keyframe Selection for Long Video Understanding
Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region. On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs. Code is available at https://github.com/NUS-HPC-AI-Lab/FOCUS.
♻ ☆ BioDisco: Multi-agent hypothesis generation with dual-mode evidence, iterative feedback and temporal evaluation
Identifying novel hypotheses is essential to scientific research, yet this process risks being overwhelmed by the sheer volume and complexity of available information. Existing automated methods often struggle to generate novel and evidence-grounded hypotheses, lack robust iterative refinement and rarely undergo rigorous temporal evaluation for future discovery potential. To address this, we propose BioDisco, a multi-agent framework that draws upon language model-based reasoning and a dual-mode evidence system (biomedical knowledge graphs and automated literature retrieval) for grounded novelty, integrates an internal scoring and feedback loop for iterative refinement, and validates performance through pioneering temporal and human evaluations and a Bradley-Terry paired comparison model to provide statistically-grounded assessment. Our evaluations demonstrate superior novelty and significance over ablated configurations and generalist biomedical agents. Designed for flexibility and modularity, BioDisco allows seamless integration of custom language models or knowledge graphs, and can be run with just a few lines of code.
comment: 12 pages main content, 31 including appendices. 8 figures
♻ ☆ Word-level Annotation of GDPR Transparency Compliance in Privacy Policies using Large Language Models
Ensuring transparency of data practices related to personal information is a core requirement of the General Data Protection Regulation (GDPR). However, large-scale compliance assessment remains challenging due to the complexity and diversity of privacy policy language. Manual audits are labour-intensive and inconsistent, while current automated methods often lack the granularity required to capture nuanced transparency disclosures. In this paper, we present a modular large language model (LLM)-based pipeline for fine-grained word-level annotation of privacy policies with respect to GDPR transparency requirements. Our approach integrates LLM-driven annotation with passage-level classification, retrieval-augmented generation, and a self-correction mechanism to deliver scalable, context-aware annotations across 21 GDPR-derived transparency requirements. To support empirical evaluation, we compile a corpus of 703,791 English-language privacy policies and generate a ground-truth sample of 200 manually annotated policies based on a comprehensive, GDPR-aligned annotation scheme. We propose a two-tiered evaluation methodology capturing both passage-level classification and span-level annotation quality and conduct a comparative analysis of seven state-of-the-art LLMs on two annotation schemes, including the widely used OPP-115 dataset. The results of our evaluation show that decomposing the annotation task and integrating targeted retrieval and classification components significantly improve annotation accuracy, particularly for well-structured requirements. Our work provides new empirical resources and methodological foundations for advancing automated transparency compliance assessment at scale.
comment: Accepted to Proceedings on Privacy Enhancing Technologies (PoPETs) 1 (2026)
♻ ☆ A Survey of Generative Categories and Techniques in Multimodal Generative Models
Multimodal Generative Models (MGMs) have rapidly evolved beyond text generation, now spanning diverse output modalities including images, music, video, human motion, and 3D objects, by integrating language with other sensory modalities under unified architectures. This survey categorises six primary generative modalities and examines how foundational techniques, namely Self-Supervised Learning (SSL), Mixture of Experts (MoE), Reinforcement Learning from Human Feedback (RLHF), and Chain-of-Thought (CoT) prompting, enable cross-modal capabilities. We analyze key models, architectural trends, and emergent cross-modal synergies, while highlighting transferable techniques and unresolved challenges. Building on a common taxonomy of models and training recipes, we propose a unified evaluation framework centred on faithfulness, compositionality, and robustness, and synthesise evidence from benchmarks and human studies across modalities. We further analyse trustworthiness, safety, and ethical risks, including multimodal bias, privacy leakage, and the misuse of high-fidelity media generation for deepfakes, disinformation, and copyright infringement in music and 3D assets, together with emerging mitigation strategies. Finally, we discuss how architectural trends, evaluation protocols, and governance mechanisms can be co-designed to close current capability and safety gaps, outlining critical paths toward more general-purpose, controllable, and accountable multimodal generative systems.
♻ ☆ WorldLLM: Improving LLMs' world modeling using curiosity-driven theory-making
Large Language Models (LLMs) possess general world knowledge but often struggle to generate precise predictions in structured, domain-specific contexts such as simulations. These limitations arise from their inability to ground their broad, unstructured understanding in specific environments. To address this, we present WorldLLM, a framework that enhances LLM-based world modeling by combining Bayesian inference and autonomous active exploration with reinforcement learning. WorldLLM leverages the in-context learning abilities of LLMs to guide an LLM-based world model's predictions using natural language hypotheses given in its prompt. These hypotheses are iteratively refined through a Bayesian inference framework that leverages a second LLM as the proposal distribution given collected evidence. This evidence is collected using a curiosity-driven reinforcement learning policy that explores the environment to find transitions with a low log-likelihood under our LLM-based predictive model using the current hypotheses. By alternating between refining hypotheses and collecting new evidence, our framework autonomously drives continual improvement of the predictions. Our experiments demonstrate the effectiveness of WorldLLM in a textual game environment that requires agents to manipulate and combine objects. The framework not only enhances predictive accuracy, but also generates human-interpretable theories of environment dynamics.
♻ ☆ VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval
Prevailing joint prediction transformers for Video Highlight Detection and Moment Retrieval (HD/MR) exhibit deficiencies in handling cross-task dynamics, achieving robust video-text alignment, and utilizing effective attention mechanisms, with the potential of Large Language/Vision-Language Models (LLMs/LVLMs) being largely untapped. This paper introduces VideoLights, a novel HD/MR framework addressing these limitations by incorporating: (i) Convolutional Projection and Feature Refinement modules with an alignment loss for enhanced video-text feature congruity; (ii) a Bi-Directional Cross-Modal Fusion network for strongly coupled query-aware representations; (iii) a Uni-directional joint-task feedback mechanism for synergistic task improvement; (iv) hard positive/negative losses for adaptive learning; and (v) the leveraging of LVLMs (e.g., BLIP-2) for superior multimodal feature integration and intelligent pre-training with synthetic data. Comprehensive evaluations on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate that VideoLights significantly surpasses existing baselines, establishing new state-of-the-art performances. Codes and model checkpoints are available at https://github.com/dpaul06/VideoLights .
♻ ☆ Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?
Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gödel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that LIVE-SWE-AGENT can achieve an impressive solve rate of 77.4% without test-time scaling, outperforming all existing software agents, including the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.
♻ ☆ Automatic Multi-View X-Ray/CT Registration Using Bone Substructure Contours
Purpose: Accurate intraoperative X-ray/CT registration is essential for surgical navigation in orthopedic procedures. However, existing methods struggle with consistently achieving sub-millimeter accuracy, robustness under broad initial pose estimates or need manual key-point annotations. This work aims to address these challenges by proposing a novel multi-view X-ray/CT registration method for intraoperative bone registration. Methods: The proposed registration method consists of a multi-view, contour-based iterative closest point (ICP) optimization. Unlike previous methods, which attempt to match bone contours across the entire silhouette in both imaging modalities, we focus on matching specific subcategories of contours corresponding to bone substructures. This leads to reduced ambiguity in the ICP matches, resulting in a more robust and accurate registration solution. This approach requires only two X-ray images and operates fully automatically. Additionally, we contribute a dataset of 5 cadaveric specimens, including real X-ray images, X-ray image poses and the corresponding CT scans. Results: The proposed registration method is evaluated on real X-ray images using mean reprojection error (mRPD). The method consistently achieves sub-millimeter accuracy with a mRPD 0.67mm compared to 5.35mm by a commercial solution requiring manual intervention. Furthermore, the method offers improved practical applicability, being fully automatic. Conclusion: Our method offers a practical, accurate, and efficient solution for multi-view X-ray/CT registration in orthopedic surgeries, which can be easily combined with tracking systems. By improving registration accuracy and minimizing manual intervention, it enhances intraoperative navigation, contributing to more accurate and effective surgical outcomes in computer-assisted surgery (CAS).
comment: This paper was accepted to IPCAI 2025. The Project Webpage is: https://rflepp.github.io/BoneSubstructureContours2D3DRegistration/
♻ ☆ Distributionally Robust Free Energy Principle for Decision-Making
Despite their groundbreaking performance, autonomous agents can misbehave when training and environmental conditions become inconsistent, with minor mismatches leading to undesirable behaviors or even catastrophic failures. Robustness towards these training-environment ambiguities is a core requirement for intelligent agents and its fulfillment is a long-standing challenge towards their real-world deployments. Here, we introduce a Distributionally Robust Free Energy model (DR-FREE) that instills this core property by design. Combining a robust extension of the free energy principle with a resolution engine, DR-FREE wires robustness into the agent decision-making mechanisms. Across benchmark experiments, DR-FREE enables the agents to complete the task even when, in contrast, state-of-the-art models fail. This milestone may inspire both deployments in multi-agent settings and, at a perhaps deeper level, the quest for an explanation of how natural agents -- with little or no training -- survive in capricious environments.
comment: Contains main text and supplementary information. Supplementary movie is at the paper repository
♻ ☆ Learning to Call: A Field Trial of a Collaborative Bandit Algorithm for Improved Message Delivery in Mobile Maternal Health
Mobile health (mHealth) programs utilize automated voice messages to deliver health information, particularly targeting underserved communities, demonstrating the effectiveness of using mobile technology to disseminate crucial health information to these populations, improving health outcomes through increased awareness and behavioral change. India's Kilkari program delivers vital maternal health information via weekly voice calls to millions of mothers. However, the current random call scheduling often results in missed calls and reduced message delivery. This study presents a field trial of a collaborative bandit algorithm designed to optimize call timing by learning individual mothers' preferred call times. We deployed the algorithm with around $6500$ Kilkari participants as a pilot study, comparing its performance to the baseline random calling approach. Our results demonstrate a statistically significant improvement in call pick-up rates with the bandit algorithm, indicating its potential to enhance message delivery and impact millions of mothers across India. This research highlights the efficacy of personalized scheduling in mobile health interventions and underscores the potential of machine learning to improve maternal health outreach at scale.
♻ ☆ Benchmarking the Spatial Robustness of DNNs via Natural and Adversarial Localized Corruptions
The robustness of deep neural networks is a crucial factor in safety-critical applications, particularly in complex and dynamic environments (e.g., medical or driving scenarios) where localized corruptions can arise. While previous studies have evaluated the robustness of semantic segmentation (SS) models under whole-image natural or adversarial corruptions, a comprehensive investigation into the spatial robustness of dense vision models under localized corruptions remains underexplored. This paper fills this gap by introducing novel, region-aware metrics for benchmarking the spatial robustness of segmentation models, along with an evaluation framework to assess the impact of natural localized corruptions. Furthermore, it uncovers the inherent complexity of evaluating worst-case spatial robustness using only a single localized adversarial attack. To address this, the work proposes a region-aware multi-attack adversarial analysis to systematically assess model robustness across specific image regions. The proposed metrics and analysis were exploited to evaluate 14 segmentation models in driving scenarios, uncovering key insights into the effects of localized corruption in both natural and adversarial forms. The results reveal that models respond to these two types of threats differently; for instance, transformer-based segmentation models demonstrate notable robustness to localized natural corruptions but are highly vulnerable to adversarial ones, and vice versa for CNN-based models. Consequently, we also address the challenge of balancing robustness to both natural and adversarial localized corruptions by means of ensemble models, thereby achieving a broader threat coverage and improved reliability for dense vision tasks.
comment: Accepted for publication in Pattern Recognition
♻ ☆ In-Situ Tweedie Discrete Diffusion Models
While diffusion models excel at generating continuous data such as images, adapting them to discrete tasks has relied on indirect approaches that either operate in continuous embedding spaces or use token masking mechanisms, both of which deviate from modeling the true discrete data distribution that can be theoretically guaranteed by Tweedie's formula. We propose in-situ Tweedie Discrete Diffusion (TDD), a framework that performs diffusion guaranteed by Tweedie's formula directly within the discrete one-hot space, hence "in-situ." Unlike prior methods that diffuse continuous embeddings or mask tokens, TDD directly corrupts one-hot vectors with Gaussian noise and performs iterative denoising through a timestep-conditioned cross-entropy objective rather than mean-squared-error reconstruction. At each denoising step, the model predicts class probabilities, applies argmax to obtain discrete predictions, converts them to one-hot vectors, and feeds them into the next iteration with progressively reduced noise. This process naturally unifies discriminative classification and generative modeling under a single framework. Experiments demonstrate that TDD achieves strong performance on both image classification and text generation tasks, with extensive ablation studies confirming the effectiveness of each design component. Our work establishes a principled approach to discrete diffusion that preserves the core characteristics of diffusion models while operating natively in discrete space.
♻ ☆ AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking
Recent studies have shown that large language models (LLMs), especially smaller ones, often lack robustness in grade school math (GSM) reasoning. In particular, they tend to experience performance drops when faced with distribution shifts, such as changes to numerical or nominal variables, or insertions of distracting clauses. A possible strategy to address this involves generating synthetic data to further "instantiate" reasoning problems on potential variations. In this work, we instead focuses on the strategy of "abstracting" reasoning problems. This not only helps counteract distribution shifts but also facilitates the connection to symbolic tools for deriving solutions. Focusing on GSM, we find that this abstraction process is better acquired through reinforcement learning (RL) than just supervised fine-tuning, which often fails to produce faithful abstractions. Our method, AbstRaL -- which promotes abstract reasoning in LLMs using RL on granular abstraction data -- significantly mitigates performance degradation on recent GSM perturbation benchmarks. Besides, improving GSM robustness via AbstRaL is shown to also implicitly benefit LLMs' capabilities on OOD mathematical and general reasoning tasks, indicating that abstract thinking broadly enables better generalizability.
comment: Under review
♻ ☆ ReefNet: A Large scale, Taxonomically Enriched Dataset and Benchmark for Hard Coral Classification
Coral reefs are rapidly declining due to anthropogenic pressures such as climate change, underscoring the urgent need for scalable, automated monitoring. We introduce ReefNet, a large public coral reef image dataset with point-label annotations mapped to the World Register of Marine Species (WoRMS). ReefNet aggregates imagery from 76 curated CoralNet sources and an additional site from Al Wajh in the Red Sea, totaling approximately 925000 genus-level hard coral annotations with expert-verified labels. Unlike prior datasets, which are often limited by size, geography, or coarse labels and are not ML-ready, ReefNet offers fine-grained, taxonomically mapped labels at a global scale to WoRMS. We propose two evaluation settings: (i) a within-source benchmark that partitions each source's images for localized evaluation, and (ii) a cross-source benchmark that withholds entire sources to test domain generalization. We analyze both supervised and zero-shot classification performance on ReefNet and find that while supervised within-source performance is promising, supervised performance drops sharply across domains, and performance is low across the board for zero-shot models, especially for rare and visually similar genera. This provides a challenging benchmark intended to catalyze advances in domain generalization and fine-grained coral classification. We will release our dataset, benchmarking code, and pretrained models to advance robust, domain-adaptive, global coral reef monitoring and conservation.
♻ ☆ Can LLM-based Financial Investing Strategies Outperform the Market in Long Run? KDD 2026
Large Language Models (LLMs) have recently been leveraged for asset pricing tasks and stock trading applications, enabling AI agents to generate investment decisions from unstructured financial data. However, most evaluations of LLM timing-based investing strategies are conducted on narrow timeframes and limited stock universes, overstating effectiveness due to survivorship and data-snooping biases. We critically assess their generalizability and robustness by proposing FINSABER, a backtesting framework evaluating timing-based strategies across longer periods and a larger universe of symbols. Systematic backtests over two decades and 100+ symbols reveal that previously reported LLM advantages deteriorate significantly under broader cross-section and over a longer-term evaluation. Our market regime analysis further demonstrates that LLM strategies are overly conservative in bull markets, underperforming passive benchmarks, and overly aggressive in bear markets, incurring heavy losses. These findings highlight the need to develop LLM strategies that are able to prioritise trend detection and regime-aware risk controls over mere scaling of framework complexity.
comment: Accepted to KDD 2026, Datasets & Benchmarks Track
♻ ☆ MoveGPT: Scaling Mobility Foundation Models with Spatially-Aware Mixture of Experts
The success of foundation models in language has inspired a new wave of general-purpose models for human mobility. However, existing approaches struggle to scale effectively due to two fundamental limitations: a failure to use meaningful basic units to represent movement, and an inability to capture the vast diversity of patterns found in large-scale data. In this work, we develop MoveGPT, a large-scale foundation model specifically architected to overcome these barriers. MoveGPT is built upon two key innovations: (1) a unified location encoder that maps geographically disjoint locations into a shared semantic space, enabling pre-training on a global scale; and (2) a Spatially-Aware Mixture-of-Experts Transformer that develops specialized experts to efficiently capture diverse mobility patterns. Pre-trained on billion-scale datasets, MoveGPT establishes a new state-of-the-art across a wide range of downstream tasks, achieving performance gains of up to 35% on average. It also demonstrates strong generalization capabilities to unseen cities. Crucially, our work provides empirical evidence of scaling ability in human mobility, validating a clear path toward building increasingly capable foundation models in this domain.
♻ ☆ Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI
AI Video Chat emerges as a new paradigm for Real-time Communication (RTC), where one peer is not a human, but a Multimodal Large Language Model (MLLM). This makes interaction between humans and AI more intuitive, as if chatting face-to-face with a real person. However, this poses significant challenges to latency, because the MLLM inference takes up most of the response time, leaving very little time for video streaming. Due to network uncertainty, transmission latency becomes a critical bottleneck preventing AI from being like a real person. To address this, we call for AI-oriented RTC research, exploring the network requirement shift from "humans watching video" to "AI understanding video". We begin by recognizing the main differences between AI Video Chat and traditional RTC. Then, through prototype measurements, we identify that ultra-low bitrate is a key factor for low latency. To reduce bitrate dramatically while maintaining MLLM accuracy, we propose Context-Aware Video Streaming that recognizes the importance of each video region for chat and allocates bitrate almost exclusively to chat-important regions. To evaluate the impact of video streaming quality on MLLM accuracy, we build the first benchmark, named Degraded Video Understanding Benchmark (DeViBench). Finally, we discuss some open questions and ongoing solutions for AI Video Chat. DeViBench is open-sourced at: https://github.com/pku-netvideo/DeViBench.
comment: 9 pages, 10 figures, Proceedings of the 24th ACM Workshop on Hot Topics in Networks (HotNets 2025), College Park, Maryland, USA
♻ ☆ Inferring response times of perceptual decisions with Poisson variational autoencoders NeurIPS 2025
Many properties of perceptual decision making are well-modeled by deep neural networks. However, such architectures typically treat decisions as instantaneous readouts, overlooking the temporal dynamics of the decision process. We present an image-computable model of perceptual decision making in which choices and response times arise from efficient sensory encoding and Bayesian decoding of neural spiking activity. We use a Poisson variational autoencoder to learn unsupervised representations of visual stimuli in a population of rate-coded neurons, modeled as independent homogeneous Poisson processes. A task-optimized decoder then continually infers an approximate posterior over actions conditioned on incoming spiking activity. Combining these components with an entropy-based stopping rule yields a principled and image-computable model of perceptual decisions capable of generating trial-by-trial patterns of choices and response times. Applied to MNIST digit classification, the model reproduces key empirical signatures of perceptual decision making, including stochastic variability, right-skewed response time distributions, logarithmic scaling of response times with the number of alternatives (Hick's law), and speed-accuracy trade-offs.
comment: To appear at the NeurIPS 2025 Workshop on Data on the Brain \& Mind
♻ ☆ PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization
Designing high-performance kernels requires expert-level tuning and a deep understanding of hardware characteristics. Recent advances in large language models (LLMs) have enabled automated kernel generation, yet most existing systems rely solely on correctness or execution time feedback, lacking the ability to reason about low-level performance bottlenecks. In this paper, we introduce PRAGMA, a profile-guided AI kernel generation framework that integrates execution feedback and fine-grained hardware profiling into the reasoning loop. PRAGMA enables LLMs to identify performance bottlenecks, preserve historical best versions, and iteratively refine code quality. We evaluate PRAGMA on KernelBench, covering GPU and CPU backends. Results show that PRAGMA consistently outperforms baseline AIKG without profiling enabled and achieves 2.81$\times$ and 2.30$\times$ averaged speedups against Torch on CPU and GPU platforms, respectively.
♻ ☆ Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment AAAI-26
Safety alignment instills in Large Language Models (LLMs) a critical capacity to refuse malicious requests. Prior works have modeled this refusal mechanism as a single linear direction in the activation space. We posit that this is an oversimplification that conflates two functionally distinct neural processes: the detection of harm and the execution of a refusal. In this work, we deconstruct this single representation into a Harm Detection Direction and a Refusal Execution Direction. Leveraging this fine-grained model, we introduce Differentiated Bi-Directional Intervention (DBDI), a new white-box framework that precisely neutralizes the safety alignment at critical layer. DBDI applies adaptive projection nullification to the refusal execution direction while suppressing the harm detection direction via direct steering. Extensive experiments demonstrate that DBDI outperforms prominent jailbreaking methods, achieving up to a 97.88\% attack success rate on models such as Llama-2. By providing a more granular and mechanistic framework, our work offers a new direction for the in-depth understanding of LLM safety alignment.
comment: AAAI-26-AIA
♻ ☆ Warm Chat: Diffuse Emotion-aware Interactive Talking Head Avatar with Tree-Structured Guidance
Generative models have advanced rapidly, enabling impressive talking head generation that brings AI to life. However, most existing methods focus solely on one-way portrait animation. Even the few that support bidirectional conversational interactions lack precise emotion-adaptive capabilities, significantly limiting their practical applicability. In this paper, we propose Warm Chat, a novel emotion-aware talking head generation framework for dyadic interactions. Leveraging the dialogue generation capability of large language models (LLMs, e.g., GPT-4), our method produces temporally consistent virtual avatars with rich emotional variations that seamlessly transition between speaking and listening states. Specifically, we design a Transformer-based head mask generator that learns temporally consistent motion features in a latent mask space, capable of generating arbitrary-length, temporally consistent mask sequences to constrain head motions. Furthermore, we introduce an interactive talking tree structure to represent dialogue state transitions, where each tree node contains information such as child/parent/sibling nodes and the current character's emotional state. By performing reverse-level traversal, we extract rich historical emotional cues from the current node to guide expression synthesis. Extensive experiments demonstrate the superior performance and effectiveness of our method.
comment: The submission is withdrawn at the request of the authors due to internal reasons within the research team
♻ ☆ Causally Reliable Concept Bottleneck Models NeurIPS 2025
Concept-based models are an emerging paradigm in deep learning that constrains the inference process to operate through human-interpretable variables, facilitating explainability and human interaction. However, these architectures, on par with popular opaque neural models, fail to account for the true causal mechanisms underlying the target phenomena represented in the data. This hampers their ability to support causal reasoning tasks, limits out-of-distribution generalization, and hinders the implementation of fairness constraints. To overcome these issues, we propose Causally reliable Concept Bottleneck Models (C$^2$BMs), a class of concept-based architectures that enforce reasoning through a bottleneck of concepts structured according to a model of the real-world causal mechanisms. We also introduce a pipeline to automatically learn this structure from observational data and unstructured background knowledge (e.g., scientific literature). Experimental evidence suggests that C$^2$BMs are more interpretable, causally reliable, and improve responsiveness to interventions w.r.t. standard opaque and concept-based models, while maintaining their accuracy.
comment: Accepted at NeurIPS 2025
♻ ☆ Teacher Encoder-Student Decoder Denoising Guided Segmentation Network for Anomaly Detection
Visual anomaly detection is a highly challenging task, often categorized as a one-class classification and segmentation problem. Recent studies have demonstrated that the student-teacher (S-T) framework effectively addresses this challenge. However, most S-T frameworks rely solely on pre-trained teacher networks to guide student networks in learning multi-scale similar features, overlooking the potential of the student networks to enhance learning through multi-scale feature fusion. In this study, we propose a novel model named PFADSeg, which integrates a pre-trained teacher network, a denoising student network with multi-scale feature fusion, and a guided anomaly segmentation network into a unified framework. By adopting a unique teacher-encoder and student-decoder denoising mode, the model improves the student network's ability to learn from teacher network features. Furthermore, an adaptive feature fusion mechanism is introduced to train a self-supervised segmentation network that synthesizes anomaly masks autonomously, significantly increasing detection performance. Rigorous evaluations on the widely-used MVTec AD dataset demonstrate that PFADSeg exhibits excellent performance, achieving an image-level AUC of 98.9%, a pixel-level mean precision of 76.4%, and an instance-level mean precision of 78.7%.
♻ ☆ Q-SAM2: Accurate Quantization for Segment Anything Model 2
The Segment Anything Model 2 (SAM2) is a powerful foundation model for promptable segmentation. However, its high computational and memory costs are a major barrier to deployment on resource-constrained devices. In this paper, we present Q-SAM2, an accurate low-bit quantization method that achieves high compression and high fidelity. To address performance degradation arising from challenging weight and activation distributions during quantization, Q-SAM2 introduces two novel contributions: Variance-Reduced Calibration (VRC), an initialization method that reduces weight statistical variance by minimizing the Frobenius norm over a small calibration batch; and Learnable Statistical Clipping (LSC), a Quantization-Aware Training (QAT) method that learns momentum-stabilized clipping factors to manage outliers in weights and activations. Comprehensive experiments demonstrate that Q-SAM2 achieves highly accurate inference with substantial efficiency gains, significantly surpassing state-of-the-art general QAT schemes, particularly in the ultra-low 2-bit regime. Specifically, Q-SAM2 achieves an accuracy gain of up to 9.7 ppt in J&F on the video segmentation benchmark and 7.3 ppt in mIoU for instance segmentation over the best competing QAT model, all while achieving an 8x reduction in model size compared to the BF16 baseline.
comment: 22 pages
♻ ☆ M2R2: MultiModal Robotic Representation for Temporal Action Segmentation
Temporal action segmentation (TAS) has long been a key area of research in both robotics and computer vision. In robotics, algorithms have primarily focused on leveraging proprioceptive information to determine skill boundaries, with recent approaches in surgical robotics incorporating vision. In contrast, computer vision typically relies on exteroceptive sensors, such as cameras. Existing multimodal TAS models in robotics integrate feature fusion within the model, making it difficult to reuse learned features across different models. Meanwhile, pretrained vision-only feature extractors commonly used in computer vision struggle in scenarios with limited object visibility. In this work, we address these challenges by proposing M2R2, a multimodal feature extractor tailored for TAS, which combines information from both proprioceptive and exteroceptive sensors. We introduce a novel pretraining strategy that enables the reuse of learned features across multiple TAS models. Our method achieves state-of-the-art performance on the REASSEMBLE dataset, a challenging multimodal robotic assembly dataset, outperforming existing robotic action segmentation models by 46.6%. Additionally, we conduct an extensive ablation study to evaluate the contribution of different modalities in robotic TAS tasks.
comment: 8 pages, 6 figures, 2 tables
♻ ☆ Preventing Shortcut Learning in Medical Image Analysis through Intermediate Layer Knowledge Distillation from Specialist Teachers
Deep learning models are prone to learning shortcut solutions to problems using spuriously correlated yet irrelevant features of their training data. In high-risk applications such as medical image analysis, this phenomenon may prevent models from using clinically meaningful features when making predictions, potentially leading to poor robustness and harm to patients. We demonstrate that different types of shortcuts (those that are diffuse and spread throughout the image, as well as those that are localized to specific areas) manifest distinctly across network layers and can, therefore, be more effectively targeted through mitigation strategies that target the intermediate layers. We propose a novel knowledge distillation framework that leverages a teacher network fine-tuned on a small subset of task-relevant data to mitigate shortcut learning in a student network trained on a large dataset corrupted with a bias feature. Through extensive experiments on CheXpert, ISIC 2017, and SimBA datasets using various architectures (ResNet-18, AlexNet, DenseNet-121, and 3D CNNs), we demonstrate consistent improvements over traditional Empirical Risk Minimization, augmentation-based bias-mitigation, and group-based bias-mitigation approaches. In many cases, we achieve comparable performance with a baseline model trained on bias-free data, even on out-of-distribution test data. Our results demonstrate the practical applicability of our approach to real-world medical imaging scenarios where bias annotations are limited and shortcut features are difficult to identify a priori.
comment: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2025:020
♻ ☆ Agent-OM: Leveraging LLM Agents for Ontology Matching
Ontology matching (OM) enables semantic interoperability between different ontologies and resolves their conceptual heterogeneity by aligning related entities. OM systems currently have two prevailing design paradigms: conventional knowledge-based expert systems and newer machine learning-based predictive systems. While large language models (LLMs) and LLM agents have revolutionised data engineering and have been applied creatively in many domains, their potential for OM remains underexplored. This study introduces a novel agent-powered LLM-based design paradigm for OM systems. With consideration of several specific challenges in leveraging LLM agents for OM, we propose a generic framework, namely Agent-OM (Agent for Ontology Matching), consisting of two Siamese agents for retrieval and matching, with a set of OM tools. Our framework is implemented in a proof-of-concept system. Evaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks over state-of-the-art OM systems show that our system can achieve results very close to the long-standing best performance on simple OM tasks and can significantly improve the performance on complex and few-shot OM tasks.
comment: 31 pages
♻ ☆ OMGSR: You Only Need One Mid-timestep Guidance for Real-World Image Super-Resolution
Denoising Diffusion Probabilistic Models (DDPMs) show promising potential in one-step Real-World Image Super-Resolution (Real-ISR). Current one-step Real-ISR methods typically inject the low-quality (LQ) image latent representation at the start or end timestep of the DDPM scheduler. Recent studies have begun to note that the LQ image latent and the pre-trained noisy latent representations are intuitively closer at a mid-timestep. However, a quantitative analysis of these latent representations remains lacking. Considering these latent representations can be decomposed into signal and noise, we propose a method based on the Signal-to-Noise Ratio (SNR) to pre-compute an average optimal mid-timestep for injection. To better approximate the pre-trained noisy latent representation, we further introduce the Latent Representation Refinement (LRR) loss via a LoRA-enhanced VAE encoder. We also fine-tune the backbone of the DDPM-based generative model using LoRA to perform one-step denoising at the average optimal mid-timestep. Based on these components, we present OMGSR, a GAN-based Real-ISR framework that employs a DDPM-based generative model as the generator and a DINOv3-ConvNeXt model with multi-level discriminator heads as the discriminator. We also propose the DINOv3-ConvNeXt DISTS (Dv3CD) loss, which is enhanced for structural perception at varying resolutions. Within the OMGSR framework, we develop OMGSR-S based on SD2.1-base. An ablation study confirms that our pre-computation strategy and LRR loss significantly improve the baseline. Comparative studies demonstrate that OMGSR-S achieves state-of-the-art performance across multiple metrics. Code is available at \hyperlink{Github}{https://github.com/wuer5/OMGSR}.
♻ ☆ Developing an Algorithm Selector for Green Configuration in Scheduling Problems
The Job Shop Scheduling Problem (JSP) is central to operations research, primarily optimizing energy efficiency due to its profound environmental and economic implications. Efficient scheduling enhances production metrics and mitigates energy consumption, thus effectively balancing productivity and sustainability objectives. Given the intricate and diverse nature of JSP instances, along with the array of algorithms developed to tackle these challenges, an intelligent algorithm selection tool becomes paramount. This paper introduces a framework designed to identify key problem features that characterize its complexity and guide the selection of suitable algorithms. Leveraging machine learning techniques, particularly XGBoost, the framework recommends optimal solvers such as GUROBI, CPLEX, and GECODE for efficient JSP scheduling. GUROBI excels with smaller instances, while GECODE demonstrates robust scalability for complex scenarios. The proposed algorithm selector achieves an accuracy of 84.51\% in recommending the best algorithm for solving new JSP instances, highlighting its efficacy in algorithm selection. By refining feature extraction methodologies, the framework aims to broaden its applicability across diverse JSP scenarios, thereby advancing efficiency and sustainability in manufacturing logistics.
♻ ☆ Autonomous Vehicle Path Planning by Searching With Differentiable Simulation
Planning allows an agent to safely refine its actions before executing them in the real world. In autonomous driving, this is crucial to avoid collisions and navigate in complex, dense traffic scenarios. One way to plan is to search for the best action sequence. However, this is challenging when all necessary components - policy, next-state predictor, and critic - have to be learned. Here we propose Differentiable Simulation for Search (DSS), a framework that leverages the differentiable simulator Waymax as both a next state predictor and a critic. It relies on the simulator's hardcoded dynamics, making state predictions highly accurate, while utilizing the simulator's differentiability to effectively search across action sequences. Our DSS agent optimizes its actions using gradient descent over imagined future trajectories. We show experimentally that DSS - the combination of planning gradients and stochastic search - significantly improves tracking and path planning accuracy compared to sequence prediction, imitation learning, model-free RL, and other planning methods.
♻ ☆ GRAPHIC--Guidelines for Reviewing Algorithmic Practices in Human-centred Design and Interaction for Creativity
Artificial Intelligence (AI) has been increasingly applied to creative domains, leading to the development of systems that collaborate with humans in design processes. In Graphic Design, integrating computational systems into co-creative workflows presents specific challenges, as it requires balancing scientific rigour with the subjective and visual nature of design practice. Following the PRISMA methodology, we identified 872 articles, resulting in a final corpus of 71 publications describing 68 unique systems. Based on this review, we introduce GRAPHIC (Guidelines for Reviewing Algorithmic Practices in Human-centred Design and Interaction for Creativity), a framework for analysing computational systems applied to Graphic Design. Its goal is to understand how current systems support human-AI collaboration in the Graphic Design discipline. The framework comprises main dimensions, which our analysis revealed to be essential across diverse system types: (1) Collaborative Panorama, (2) Processes and Modalities, and (3) Graphic Design Principles. Its application revealed research gaps, including the need to balance initiative and control between agents, improve communication through explainable interaction models, and promote systems that support transformational creativity grounded in core design principles.
comment: 20 pages, 16 figures
♻ ☆ Future-Back Threat Modeling: A Foresight-Driven Security Framework
Traditional threat modeling remains reactive-focused on known TTPs and past incident data, while threat prediction and forecasting frameworks are often disconnected from operational or architectural artifacts. This creates a fundamental weakness: the most serious cyber threats often do not arise from what is known, but from what is assumed, overlooked, or not yet conceived, and frequently originate from the future, such as artificial intelligence, information warfare, and supply chain attacks, where adversaries continuously develop new exploits that can bypass defenses built on current knowledge. To address this mental gap, this paper introduces the theory and methodology of Future-Back Threat Modeling (FBTM). This predictive approach begins with envisioned future threat states and works backward to identify assumptions, gaps, blind spots, and vulnerabilities in the current defense architecture, providing a clearer and more accurate view of impending threats so that we can anticipate their emergence and shape the future we want through actions taken now. The proposed methodology further aims to reveal known unknowns and unknown unknowns, including tactics, techniques, and procedures that are emerging, anticipated, and plausible. This enhances the predictability of adversary behavior, particularly under future uncertainty, helping security leaders make informed decisions today that shape more resilient security postures for the future.
♻ ☆ Beyond SELECT: A Comprehensive Taxonomy-Guided Benchmark for Real-World Text-to-SQL Translation
Text-to-SQL datasets are essential for training and evaluating text-to-SQL models, but existing datasets often suffer from limited coverage and fail to capture the diversity of real-world applications. To address this, we propose a novel taxonomy for text-to-SQL classification based on dimensions including core intents, statement types, syntax structures, and key actions. Using this taxonomy, we evaluate widely used public text-to-SQL datasets (e.g., Spider and Bird) and reveal limitations in their coverage and diversity. We then introduce a taxonomy-guided dataset synthesis pipeline, yielding a new dataset named SQL-Synth. This approach combines the taxonomy with Large Language Models (LLMs) to ensure the dataset reflects the breadth and complexity of real-world text-to-SQL applications. Extensive analysis and experimental results validate the effectiveness of our taxonomy, as SQL-Synth exhibits greater diversity and coverage compared to existing benchmarks. Moreover, we uncover that existing LLMs typically fall short in adequately capturing the full range of scenarios, resulting in limited performance on SQL-Synth. However, fine-tuning can substantially improve their performance in these scenarios. The proposed taxonomy has significant potential impact, as it not only enables comprehensive analysis of datasets and the performance of different LLMs, but also guides the construction of training data for LLMs.
♻ ☆ On the dimension of pullback attractors in recurrent neural networks
Recurrent Neural Networks (RNNs) are high-dimensional state space models capable of learning functions on sequence data. Recently, it has been conjectured that reservoir computers, a particular class of RNNs, trained on observations of a dynamical systems can be interpreted as embeddings. This result has been established for the case of linear reservoir systems. In this work, we use a nonautonomous dynamical systems approach to establish an upper bound for the fractal dimension of the subset of reservoir state space approximated during training and prediction phase. We prove that when the input sequences comes from an Nin-dimensional invertible dynamical system, the fractal dimension of this set is bounded above by Nin. The result obtained here are useful in dimensionality reduction of computation in RNNs as well as estimating fractal dimensions of dynamical systems from limited observations of their time series. It is also a step towards understanding embedding properties of reservoir computers.
comment: Issues with clarity and notation
♻ ☆ DataSage: Multi-agent Collaboration for Insight Discovery with External Knowledge Retrieval, Multi-role Debating, and Multi-path Reasoning
In today's data-driven era, fully automated end-to-end data analytics, particularly insight discovery, is critical for discovering actionable insights that assist organizations in making effective decisions. With the rapid advancement of large language models (LLMs), LLM-driven agents have emerged as a promising paradigm for automating data analysis and insight discovery. However, existing data insight agents remain limited in several key aspects, often failing to deliver satisfactory results due to: (1) insufficient utilization of domain knowledge, (2) shallow analytical depth, and (3) error-prone code generation during insight generation. To address these issues, we propose DataSage, a novel multi-agent framework that incorporates three innovative features including external knowledge retrieval to enrich the analytical context, a multi-role debating mechanism to simulate diverse analytical perspectives and deepen analytical depth, and multi-path reasoning to improve the accuracy of the generated code and insights. Extensive experiments on InsightBench demonstrate that DataSage consistently outperforms existing data insight agents across all difficulty levels, offering an effective solution for automated data insight discovery.
♻ ☆ Human Cognition Inspired RAG with Knowledge Graph for Complex Problem Solving AAAI 2026
Large Language Models (LLMs) have demonstrated significant potential across various domains. However, they often struggle with integrating external knowledge and performing complex reasoning, leading to hallucinations and unreliable outputs. Retrieval Augmented Generation (RAG) has emerged as a promising paradigm to mitigate these issues by incorporating external knowledge. Yet, conventional RAG approaches, especially those based on vector similarity, fail to effectively capture relational dependencies and support multi-step reasoning. In this work, we propose CogGRAG, a human cognition-inspired, graph-based RAG framework designed for Knowledge Graph Question Answering (KGQA). CogGRAG models the reasoning process as a tree-structured mind map that decomposes the original problem into interrelated subproblems and explicitly encodes their semantic relationships. This structure not only provides a global view to guide subsequent retrieval and reasoning but also enables self-consistent verification across reasoning paths. The framework operates in three stages: (1) top-down problem decomposition via mind map construction, (2) structured retrieval of both local and global knowledge from external Knowledge Graphs (KGs), and (3) bottom-up reasoning with dual-process self-verification. Unlike previous tree-based decomposition methods such as MindMap or Graph-CoT, CogGRAG unifies problem decomposition, knowledge retrieval, and reasoning under a single graph-structured cognitive framework, allowing early integration of relational knowledge and adaptive verification. Extensive experiments demonstrate that CogGRAG achieves superior accuracy and reliability compared to existing methods.
comment: The paper has been accepted by AAAI 2026
♻ ☆ COLI: A Hierarchical Efficient Compressor for Large Images
The escalating adoption of high-resolution, large-field-of-view imagery amplifies the need for efficient compression methodologies. Conventional techniques frequently fail to preserve critical image details, while data-driven approaches exhibit limited generalizability. Implicit Neural Representations (INRs) present a promising alternative by learning continuous mappings from spatial coordinates to pixel intensities for individual images, thereby storing network weights rather than raw pixels and avoiding the generalization problem. However, INR-based compression of large images faces challenges including slow compression speed and suboptimal compression ratios. To address these limitations, we introduce COLI (Compressor for Large Images), a novel framework leveraging Neural Representations for Videos (NeRV). First, recognizing that INR-based compression constitutes a training process, we accelerate its convergence through a pretraining-finetuning paradigm, mixed-precision training, and reformulation of the sequential loss into a parallelizable objective. Second, capitalizing on INRs' transformation of image storage constraints into weight storage, we implement Hyper-Compression, a novel post-training technique to substantially enhance compression ratios while maintaining minimal output distortion. Evaluations across two medical imaging datasets demonstrate that COLI consistently achieves competitive or superior PSNR and SSIM metrics at significantly reduced bits per pixel (bpp), while accelerating NeRV training by up to 4 times.
♻ ☆ Gradient Propagation in Retrosynthetic Space: An Efficient Framework for Synthesis Plan Generation
Retrosynthesis, which aims to identify viable synthetic pathways for target molecules by decomposing them into simpler precursors, is often treated as a search problem. However, its complexity arises from multi-branched tree-structured pathways rather than linear paths. Some algorithms have been successfully applied in this task, but they either overlook the uncertainties inherent in chemical space or face limitations in practical application scenarios. To address these challenges, this paper introduces a novel gradient-propagation-based algorithmic framework for retrosynthetic route exploration. The proposed framework obtains the contributions of different nodes to the target molecule's success probability through gradient propagation and then guides the algorithm to greedily select the node with the highest contribution for expansion, thereby conducting efficient search in the chemical space. Experimental validations demonstrate that our algorithm achieves broad applicability across diverse molecular targets and exhibits superior computational efficiency compared to existing methods.
♻ ☆ Pathway to Relevance: How Cross-Encoders Implement a Semantic Variant of BM25
Mechanistic interpretation has greatly contributed to a more detailed understanding of generative language models, enabling significant progress in identifying structures that implement key behaviors through interactions between internal components. In contrast, interpretability in information retrieval (IR) remains relatively coarse-grained, and much is still unknown as to how IR models determine whether a document is relevant to a query. In this work, we address this gap by mechanistically analyzing how one commonly used model, a cross-encoder, estimates relevance. We find that the model extracts traditional relevance signals, such as term frequency and inverse document frequency, in early-to-middle layers. These concepts are then combined in later layers, similar to the well-known probabilistic ranking function, BM25. Overall, our analysis offers a more nuanced understanding of how IR models compute relevance. Isolating these components lays the groundwork for future interventions that could enhance transparency, mitigate safety risks, and improve scalability.
♻ ☆ Preprint: Exploring Inevitable Waypoints for Unsolvability Explanation in Hybrid Planning Problems
Explaining unsolvability of planning problems is of significant research interest in Explainable AI Planning. AI planning literature has reported several research efforts on generating explanations of solutions to planning problems. However, explaining the unsolvability of planning problems remains a largely open and understudied problem. A widely practiced approach to plan generation and automated problem solving, in general, is to decompose tasks into sub-problems that help progressively converge towards the goal. In this paper, we propose to adopt the same philosophy of sub-problem identification as a mechanism for analyzing and explaining unsolvability of planning problems in hybrid systems. In particular, for a given unsolvable planning problem, we propose to identify common waypoints, which are universal obstacles to plan existence; in other words, they appear on every plan from the source to the planning goal. This work envisions such waypoints as sub-problems of the planning problem and the unreachability of any of these waypoints as an explanation for the unsolvability of the original planning problem. We propose a novel method of waypoint identification by casting the problem as an instance of the longest common subsequence problem, a widely popular problem in computer science, typically considered as an illustrative example for the dynamic programming paradigm. Once the waypoints are identified, we perform symbolic reachability analysis on them to identify the earliest unreachable waypoint and report it as the explanation of unsolvability. We present experimental results on unsolvable planning problems in hybrid domains.
♻ ☆ PairHuman: A High-Fidelity Photographic Dataset for Customized Dual-Person Generation
Personalized dual-person portrait customization has considerable potential applications, such as preserving emotional memories and facilitating wedding photography planning. However, the absence of a benchmark dataset hinders the pursuit of high-quality customization in dual-person portrait generation. In this paper, we propose the PairHuman dataset, which is the first large-scale benchmark dataset specifically designed for generating dual-person portraits that meet high photographic standards. The PairHuman dataset contains more than 100K images that capture a variety of scenes, attire, and dual-person interactions, along with rich metadata, including detailed image descriptions, person localization, human keypoints, and attribute tags. We also introduce DHumanDiff, which is a baseline specifically crafted for dual-person portrait generation that features enhanced facial consistency and simultaneously balances in personalized person generation and semantic-driven scene creation. Finally, the experimental results demonstrate that our dataset and method produce highly customized portraits with superior visual quality that are tailored to human preferences. Our dataset is publicly available at https://github.com/annaoooo/PairHuman.
comment: 46 pages, 31 figures
♻ ☆ AlphaBeta is not as good as you think: a simple random games model for a better analysis of deterministic game-solving algorithms
Deterministic game-solving algorithms are conventionally analyzed in the light of their average-case complexity against a distribution of random game-trees, where leaf values are independently sampled from a fixed distribution. This simplified model enables uncluttered mathematical analysis, revealing two key properties: root value distributions asymptotically collapse to a single fixed value for finite-valued trees, and all reasonable algorithms achieve global optimality. However, these findings are artifacts of the model's design: its long criticized independence assumption strips games of structural complexity, producing trivial instances where no algorithm faces meaningful challenges. To address this limitation, we introduce a simple probabilistic model that incrementally constructs game-trees using a fixed level-wise conditional distribution. By enforcing ancestor dependencies, a critical structural feature of real-world games, our framework generates problems with adjustable difficulty while retaining some form of analytical tractability. For several algorithms, including AlphaBeta and Scout, we derive recursive formulas characterizing their average-case complexities under this model. These allow us to rigorously compare algorithms on deep game-trees, where Monte-Carlo simulations are no longer feasible. While asymptotically, all algorithms seem to converge to identical branching factor (a result analogous to that of independence-based models), deep finite trees reveal stark differences: AlphaBeta incurs a significantly larger constant multiplicative factor compared to algorithms like Scout, leading to a substantial practical slowdown. Our framework sheds new light on classical game-solving algorithms, offering rigorous evidence and analytical tools to advance the understanding of these methods under a richer, more challenging, and yet tractable model.
♻ ☆ Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
Video Models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text-based reasoning in language modeling, the development of video models motivates us to ask: Can video models reason via video generation? Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity, which serves as an ideal substrate for spatial reasoning. In this work, we explore the reasoning via video paradigm and introduce VR-Bench -- a comprehensive benchmark designed to systematically evaluate video models' reasoning capabilities. Grounded in maze-solving tasks that inherently require spatial planning and multi-step reasoning, VR-Bench contains 7,920 procedurally generated videos across five maze types and diverse visual styles. Our empirical analysis demonstrates that SFT can efficiently elicit the reasoning ability of video model. Video models exhibit stronger spatial perception during reasoning, outperforming leading VLMs and generalizing well across diverse scenarios, tasks, and levels of complexity. We further discover a test-time scaling effect, where diverse sampling during inference improves reasoning reliability by 10--20%. These findings highlight the unique potential and scalability of reasoning via video for spatial reasoning tasks.
♻ ☆ Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL
Generating accurate SQL from users' natural language questions (text-to-SQL) remains a long-standing challenge due to the complexities involved in user question understanding, database schema comprehension, and SQL generation. Traditional text-to-SQL systems, which combine human engineering and deep neural networks, have made significant progress. Subsequently, pre-trained language models (PLMs) have been developed for text-to-SQL tasks, achieving promising results. However, as modern databases and user questions grow more complex, PLMs with a limited parameter size often produce incorrect SQL. This necessitates more sophisticated and tailored optimization methods, which restricts the application of PLM-based systems. Recently, large language models (LLMs) have shown significant capabilities in natural language understanding as model scale increases. Thus, integrating LLM-based solutions can bring unique opportunities, improvements, and solutions to text-to-SQL research. In this survey, we provide a comprehensive review of existing LLM-based text-to-SQL studies. Specifically, we offer a brief overview of the technical challenges and evolutionary process of text-to-SQL. Next, we introduce the datasets and metrics designed to evaluate text-to-SQL systems. Subsequently, we present a systematic analysis of recent advances in LLM-based text-to-SQL. Finally, we make a summarization and discuss the remaining challenges in this field and suggest expectations for future research directions. All the related resources of LLM-based, including research papers, benchmarks, and open-source projects, are collected for the community in our repository: https://github.com/DEEP-PolyU/Awesome-LLM-based-Text2SQL.
comment: Accepted to IEEE TKDE2025
Machine Learning
☆ VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection
We present VDC-Agent, a self-evolving framework for Video Detailed Captioning that requires neither human annotations nor larger teacher models. The agent forms a closed loop of caption generation, principle-guided scoring (score and textual suggestions), and prompt refinement. When caption quality regresses, a self-reflection path leverages the previous chain-of-thought to amend the update. Running this process on unlabeled videos produces trajectories of (caption, score) pairs. We convert the trajectories into preference tuples and filter out samples with JSON parsing errors, resulting in VDC-Agent-19K, which contains 18,886 automatically constructed pairs. We then fine-tune the base MLLM on this dataset using an easy-to-hard curriculum direct preference optimization. Built on Qwen2.5-VL-7B-Instruct, our VDC-Agent-7B attains state-of-the-art performance on the VDC benchmark with 49.08% average accuracy and 2.50 score, surpassing specialized video captioners and improving over the base model by +5.13% accuracy and +0.27 score at similar inference cost.
☆ Breaking the Likelihood-Quality Trade-off in Diffusion Models by Merging Pretrained Experts ICLR 2025
Diffusion models for image generation often exhibit a trade-off between perceptual sample quality and data likelihood: training objectives emphasizing high-noise denoising steps yield realistic images but poor likelihoods, whereas likelihood-oriented training overweights low-noise steps and harms visual fidelity. We introduce a simple plug-and-play sampling method that combines two pretrained diffusion experts by switching between them along the denoising trajectory. Specifically, we apply an image-quality expert at high noise levels to shape global structure, then switch to a likelihood expert at low noise levels to refine pixel statistics. The approach requires no retraining or fine-tuning -- only the choice of an intermediate switching step. On CIFAR-10 and ImageNet32, the merged model consistently matches or outperforms its base components, improving or preserving both likelihood and sample quality relative to each expert alone. These results demonstrate that expert switching across noise levels is an effective way to break the likelihood-quality trade-off in image diffusion models.
comment: ICLR 2025 DeLTa workshop
☆ Flow Map Distillation Without Data
State-of-the-art flow models achieve remarkable quality but require slow, iterative sampling. To accelerate this, flow maps can be distilled from pre-trained teachers, a procedure that conventionally requires sampling from an external dataset. We argue that this data-dependency introduces a fundamental risk of Teacher-Data Mismatch, as a static dataset may provide an incomplete or even misaligned representation of the teacher's full generative capabilities. This leads us to question whether this reliance on data is truly necessary for successful flow map distillation. In this work, we explore a data-free alternative that samples only from the prior distribution, a distribution the teacher is guaranteed to follow by construction, thereby circumventing the mismatch risk entirely. To demonstrate the practical viability of this philosophy, we introduce a principled framework that learns to predict the teacher's sampling path while actively correcting for its own compounding errors to ensure high fidelity. Our approach surpasses all data-based counterparts and establishes a new state-of-the-art by a significant margin. Specifically, distilling from SiT-XL/2+REPA, our method reaches an impressive FID of 1.45 on ImageNet 256x256, and 1.49 on ImageNet 512x512, both with only 1 sampling step. We hope our work establishes a more robust paradigm for accelerating generative models and motivates the broader adoption of flow map distillation without data.
☆ Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
Vision-Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness. This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, COVT distills knowledge from lightweight vision experts, capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, the VLM with COVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, including CV-Bench, MMVP, RealWorldQA, MMStar, WorldMedQA, and HRBench, integrating COVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance by 3% to 16% and demonstrates that compact continuous visual thinking enables more precise, grounded, and interpretable multimodal intelligence.
comment: Project page: https://wakalsprojectpage.github.io/comt-website/
☆ Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration
Large Language Models (LLMs) have demonstrated remarkable capabilities in challenging, knowledge-intensive reasoning tasks. However, extending LLMs to perceive and reason over a new modality (e.g., vision), often requires costly development of large-scale vision language models (VLMs) with LLMs as backbones. Smaller VLMs are more efficient and adaptable but often lack the broad knowledge and reasoning capabilities of frontier LLMs. In this work, we propose BeMyEyes, a modular, multi-agent framework for extending LLMs to multimodal reasoning by orchestrating collaboration between efficient, adaptable VLMs as perceivers and powerful LLMs as reasoners through conversations. We then introduce a data synthesis and supervised fine-tuning pipeline to train the perceiver agent to effectively collaborate with the reasoner agent. By combining the complementary strengths of perception and reasoning agents, BeMyEyes avoids the need for training large-scale multimodal models, preserves the generalization and reasoning capabilities of LLMs, and allows flexible extension to new domains and modalities. Experiments show that our framework unlocks the multimodal reasoning capabilities for LLMs, enabling a lightweight and fully open-source solution, i.e. equipping text-only DeepSeek-R1 with Qwen2.5-VL-7B perceiver, to outperform large-scale proprietary VLMs such as GPT-4o on a wide range of knowledge-intensive multimodal tasks. These results demonstrate the effectiveness, modularity, and scalability of our multi-agent approach for building future multimodal reasoning systems.
☆ UniGame: Turning a Unified Multimodal Model Into Its Own Adversary
Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02), out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/UniGame
☆ Learning Robust Social Strategies with Large Language Models
As agentic AI becomes more widespread, agents with distinct and possibly conflicting goals will interact in complex ways. These multi-agent interactions pose a fundamental challenge, particularly in social dilemmas, where agents' individual incentives can undermine collective welfare. While reinforcement learning (RL) has been effective for aligning large language models (LLMs) in the single-agent regime, prior small-network results suggest that standard RL in multi-agent settings often converges to defecting, self-interested policies. We show the same effect in LLMs: despite cooperative priors, RL-trained LLM agents develop opportunistic behavior that can exploit even advanced closed-source models. To address this tendency of RL to converge to poor equilibria, we adapt a recent opponent-learning awareness algorithm, Advantage Alignment, to fine-tune LLMs toward multi-agent cooperation and non-exploitability. We then introduce a group-relative baseline that simplifies advantage computation in iterated games, enabling multi-agent training at LLM scale. We also contribute a novel social dilemma environment, Trust and Split, which requires natural language communication to achieve high collective welfare. Across a wide range of social dilemmas, policies learned with Advantage Alignment achieve higher collective payoffs while remaining robust against exploitation by greedy agents.
☆ Nonparametric Instrumental Variable Regression with Observed Covariates
We study the problem of nonparametric instrumental variable regression with observed covariates, which we refer to as NPIV-O. Compared with standard nonparametric instrumental variable regression (NPIV), the additional observed covariates facilitate causal identification and enables heterogeneous causal effect estimation. However, the presence of observed covariates introduces two challenges for its theoretical analysis. First, it induces a partial identity structure, which renders previous NPIV analyses - based on measures of ill-posedness, stability conditions, or link conditions - inapplicable. Second, it imposes anisotropic smoothness on the structural function. To address the first challenge, we introduce a novel Fourier measure of partial smoothing; for the second challenge, we extend the existing kernel 2SLS instrumental variable algorithm with observed covariates, termed KIV-O, to incorporate Gaussian kernel lengthscales adaptive to the anisotropic smoothness. We prove upper $L^2$-learning rates for KIV-O and the first $L^2$-minimax lower learning rates for NPIV-O. Both rates interpolate between known optimal rates of NPIV and nonparametric regression (NPR). Interestingly, we identify a gap between our upper and lower bounds, which arises from the choice of kernel lengthscales tuned to minimize a projected risk. Our theoretical analysis also applies to proximal causal inference, an emerging framework for causal effect estimation that shares the same conditional moment restriction as NPIV-O.
☆ DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research
Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.
☆ PTF Testing Lower Bounds for Non-Gaussian Component Analysis
This work studies information-computation gaps for statistical problems. A common approach for providing evidence of such gaps is to show sample complexity lower bounds (that are stronger than the information-theoretic optimum) against natural models of computation. A popular such model in the literature is the family of low-degree polynomial tests. While these tests are defined in such a way that make them easy to analyze, the class of algorithms that they rule out is somewhat restricted. An important goal in this context has been to obtain lower bounds against the stronger and more natural class of low-degree Polynomial Threshold Function (PTF) tests, i.e., any test that can be expressed as comparing some low-degree polynomial of the data to a threshold. Proving lower bounds against PTF tests has turned out to be challenging. Indeed, we are not aware of any non-trivial PTF testing lower bounds in the literature. In this paper, we establish the first non-trivial PTF testing lower bounds for a range of statistical tasks. Specifically, we prove a near-optimal PTF testing lower bound for Non-Gaussian Component Analysis (NGCA). Our NGCA lower bound implies similar lower bounds for a number of other statistical problems. Our proof leverages a connection to recent work on pseudorandom generators for PTFs and recent techniques developed in that context. At the technical level, we develop several tools of independent interest, including novel structural results for analyzing the behavior of low-degree polynomials restricted to random directions.
☆ Predicting partially observable dynamical systems via diffusion models with a multiscale inference scheme
Conditional diffusion models provide a natural framework for probabilistic prediction of dynamical systems and have been successfully applied to fluid dynamics and weather prediction. However, in many settings, the available information at a given time represents only a small fraction of what is needed to predict future states, either due to measurement uncertainty or because only a small fraction of the state can be observed. This is true for example in solar physics, where we can observe the Sun's surface and atmosphere, but its evolution is driven by internal processes for which we lack direct measurements. In this paper, we tackle the probabilistic prediction of partially observable, long-memory dynamical systems, with applications to solar dynamics and the evolution of active regions. We show that standard inference schemes, such as autoregressive rollouts, fail to capture long-range dependencies in the data, largely because they do not integrate past information effectively. To overcome this, we propose a multiscale inference scheme for diffusion models, tailored to physical processes. Our method generates trajectories that are temporally fine-grained near the present and coarser as we move farther away, which enables capturing long-range temporal dependencies without increasing computational cost. When integrated into a diffusion model, we show that our inference scheme significantly reduces the bias of the predicted distributions and improves rollout stability.
☆ Efficiency vs. Fidelity: A Comparative Analysis of Diffusion Probabilistic Models and Flow Matching on Low-Resource Hardware
Denoising Diffusion Probabilistic Models (DDPMs) have established a new state-of-the-art in generative image synthesis, yet their deployment is hindered by significant computational overhead during inference, often requiring up to 1,000 iterative steps. This study presents a rigorous comparative analysis of DDPMs against the emerging Flow Matching (Rectified Flow) paradigm, specifically isolating their geometric and efficiency properties on low-resource hardware. By implementing both frameworks on a shared Time-Conditioned U-Net backbone using the MNIST dataset, we demonstrate that Flow Matching significantly outperforms Diffusion in efficiency. Our geometric analysis reveals that Flow Matching learns a highly rectified transport path (Curvature $\mathcal{C} \approx 1.02$), which is near-optimal, whereas Diffusion trajectories remain stochastic and tortuous ($\mathcal{C} \approx 3.45$). Furthermore, we establish an ``efficiency frontier'' at $N=10$ function evaluations, where Flow Matching retains high fidelity while Diffusion collapses. Finally, we show via numerical sensitivity analysis that the learned vector field is sufficiently linear to render high-order ODE solvers (Runge-Kutta 4) unnecessary, validating the use of lightweight Euler solvers for edge deployment. \textbf{This work concludes that Flow Matching is the superior algorithmic choice for real-time, resource-constrained generative tasks.}
☆ LLM-Driven Stationarity-Aware Expert Demonstrations for Multi-Agent Reinforcement Learning in Mobile Systems
Multi-agent reinforcement learning (MARL) has been increasingly adopted in many real-world applications. While MARL enables decentralized deployment on resource-constrained edge devices, it suffers from severe non-stationarity due to the synchronous updates of agent policies. This non stationarity results in unstable training and poor policy con vergence, especially as the number of agents increases. In this paper, we propose RELED, a scalable MARL framework that integrates large language model (LLM)-driven expert demonstrations with autonomous agent exploration. RELED incorporates a Stationarity-Aware Expert Demonstration module, which leverages theoretical non-stationarity bounds to enhance the quality of LLM-generated expert trajectories, thus providing high reward and training-stable samples for each agent. Moreover, a Hybrid Expert-Agent Policy Optimization module adaptively balances each agent's learning from both expert-generated and agent-generated trajectories, accelerating policy convergence and improving generalization. Extensive experiments with real city networks based on OpenStreetMap demonstrate that RELED achieves superior performance compared to state-of-the-art MARL methods.
comment: 15 pages, 9 figures
☆ Neural surrogates for designing gravitational wave detectors
Physics simulators are essential in science and engineering, enabling the analysis, control, and design of complex systems. In experimental sciences, they are increasingly used to automate experimental design, often via combinatorial search and optimization. However, as the setups grow more complex, the computational cost of traditional, CPU-based simulators becomes a major limitation. Here, we show how neural surrogate models can significantly reduce reliance on such slow simulators while preserving accuracy. Taking the design of interferometric gravitational wave detectors as a representative example, we train a neural network to surrogate the gravitational wave physics simulator Finesse, which was developed by the LIGO community. Despite that small changes in physical parameters can change the output by orders of magnitudes, the model rapidly predicts the quality and feasibility of candidate designs, allowing an efficient exploration of large design spaces. Our algorithm loops between training the surrogate, inverse designing new experiments, and verifying their properties with the slow simulator for further training. Assisted by auto-differentiation and GPU parallelism, our method proposes high-quality experiments much faster than direct optimization. Solutions that our algorithm finds within hours outperform designs that take five days for the optimizer to reach. Though shown in the context of gravitational wave detectors, our framework is broadly applicable to other domains where simulator bottlenecks hinder optimization and discovery.
comment: 20 pages, 7 figures, 4 tables
☆ Enhancing Conformal Prediction via Class Similarity
Conformal Prediction (CP) has emerged as a powerful statistical framework for high-stakes classification applications. Instead of predicting a single class, CP generates a prediction set, guaranteed to include the true label with a pre-specified probability. The performance of different CP methods is typically assessed by their average prediction set size. In setups where the classes can be partitioned into semantic groups, e.g., diseases that require similar treatment, users can benefit from prediction sets that are not only small on average, but also contain a small number of semantically different groups. This paper begins by addressing this problem and ultimately offers a widely applicable tool for boosting any CP method on any dataset. First, given a class partition, we propose augmenting the CP score function with a term that penalizes predictions with out-of-group errors. We theoretically analyze this strategy and prove its advantages for group-related metrics. Surprisingly, we show mathematically that, for common class partitions, it can also reduce the average set size of any CP score function. Our analysis reveals the class similarity factors behind this improvement and motivates us to propose a model-specific variant, which does not require any human semantic partition and can further reduce the prediction set size. Finally, we present an extensive empirical study, encompassing prominent CP methods, multiple models, and several datasets, which demonstrates that our class-similarity-based approach consistently enhances CP methods.
☆ Leveraging LLMs for reward function design in reinforcement learning control tasks
The challenge of designing effective reward functions in reinforcement learning (RL) represents a significant bottleneck, often requiring extensive human expertise and being time-consuming. Previous work and recent advancements in large language models (LLMs) have demonstrated their potential for automating the generation of reward functions. However, existing methodologies often require preliminary evaluation metrics, human-engineered feedback for the refinement process, or the use of environmental source code as context. To address these limitations, this paper introduces LEARN-Opt (LLM-based Evaluator and Analyzer for Reward functioN Optimization). This LLM-based, fully autonomous, and model-agnostic framework eliminates the need for preliminary metrics and environmental source code as context to generate, execute, and evaluate reward function candidates from textual descriptions of systems and task objectives. LEARN-Opt's main contribution lies in its ability to autonomously derive performance metrics directly from the system description and the task objective, enabling unsupervised evaluation and selection of reward functions. Our experiments indicate that LEARN-Opt achieves performance comparable to or better to that of state-of-the-art methods, such as EUREKA, while requiring less prior knowledge. We find that automated reward design is a high-variance problem, where the average-case candidate fails, requiring a multi-run approach to find the best candidates. Finally, we show that LEARN-Opt can unlock the potential of low-cost LLMs to find high-performing candidates that are comparable to, or even better than, those of larger models. This demonstrated performance affirms its potential to generate high-quality reward functions without requiring any preliminary human-defined metrics, thereby reducing engineering overhead and enhancing generalizability.
☆ Scalable Parameter-Light Spectral Method for Clustering Short Text Embeddings with a Cohesion-Based Evaluation Metric
Clustering short text embeddings is a foundational task in natural language processing, yet remains challenging due to the need to specify the number of clusters in advance. We introduce a scalable spectral method that estimates the number of clusters directly from the structure of the Laplacian eigenspectrum, constructed using cosine similarities and guided by an adaptive sampling strategy. This sampling approach enables our estimator to efficiently scale to large datasets without sacrificing reliability. To support intrinsic evaluation of cluster quality without ground-truth labels, we propose the Cohesion Ratio, a simple and interpretable evaluation metric that quantifies how much intra-cluster similarity exceeds the global similarity background. It has an information-theoretic motivation inspired by mutual information, and in our experiments it correlates closely with extrinsic measures such as normalized mutual information and homogeneity. Extensive experiments on six short-text datasets and four modern embedding models show that standard algorithms like K-Means and HAC, when guided by our estimator, significantly outperform popular parameter-light methods such as HDBSCAN, OPTICS, and Leiden. These results demonstrate the practical value of our spectral estimator and Cohesion Ratio for unsupervised organization and evaluation of short text data. Implementation of our estimator of k and Cohesion Ratio, along with code for reproducing the experiments, is available at https://anonymous.4open.science/r/towards_clustering-0C2E.
☆ Artificial Intelligence Driven Workflow for Accelerating Design of Novel Photosensitizers
The discovery of high-performance photosensitizers has long been hindered by the time-consuming and resource-intensive nature of traditional trial-and-error approaches. Here, we present \textbf{A}I-\textbf{A}ccelerated \textbf{P}hoto\textbf{S}ensitizer \textbf{I}nnovation (AAPSI), a closed-loop workflow that integrates expert knowledge, scaffold-based molecule generation, and Bayesian optimization to accelerate the design of novel photosensitizers. The scaffold-driven generation in AAPSI ensures structural novelty and synthetic feasibility, while the iterative AI-experiment loop accelerates the discovery of novel photosensitizers. AAPSI leverages a curated database of 102,534 photosensitizer-solvent pairs and generate 6,148 synthetically accessible candidates. These candidates are screened via graph transformers trained to predict singlet oxygen quantum yield ($φ_Δ$) and absorption maxima ($λ_{max}$), following experimental validation. This work generates several novel candidates for photodynamic therapy (PDT), among which the hypocrellin-based candidate HB4Ph exhibits exceptional performance at the Pareto frontier of high quantum yield of singlet oxygen and long absorption maxima among current photosensitizers ($φ_Δ$=0.85, $λ_{max}$=650nm).
☆ Annotation-Free Class-Incremental Learning
Despite significant progress in continual learning ranging from architectural novelty to clever strategies for mitigating catastrophic forgetting most existing methods rest on a strong but unrealistic assumption the availability of labeled data throughout the learning process. In real-world scenarios, however, data often arrives sequentially and without annotations, rendering conventional approaches impractical. In this work, we revisit the fundamental assumptions of continual learning and ask: Can current systems adapt when labels are absent and tasks emerge incrementally over time? To this end, we introduce Annotation-Free Class-Incremental Learning (AFCIL), a more realistic and challenging paradigm where unlabeled data arrives continuously, and the learner must incrementally acquire new classes without any supervision. To enable effective learning under AFCIL, we propose CrossWorld CL, a Cross Domain World Guided Continual Learning framework that incorporates external world knowledge as a stable auxiliary source. The method retrieves semantically related ImageNet classes for each downstream category, maps downstream and ImageNet features through a cross domain alignment strategy and finally introduce a novel replay strategy. This design lets the model uncover semantic structure without annotations while keeping earlier knowledge intact. Across four datasets, CrossWorld-CL surpasses CLIP baselines and existing continual and unlabeled learning methods, underscoring the benefit of world knowledge for annotation free continual learning.
comment: 18 pages, 6 figures
☆ High-throughput validation of phase formability and simulation accuracy of Cantor alloys
High-throughput methods enable accelerated discovery of novel materials in complex systems such as high-entropy alloys, which exhibit intricate phase stability across vast compositional spaces. Computational approaches, including Density Functional Theory (DFT) and calculation of phase diagrams (CALPHAD), facilitate screening of phase formability as a function of composition and temperature. However, the integration of computational predictions with experimental validation remains challenging in high-throughput studies. In this work, we introduce a quantitative confidence metric to assess the agreement between predictions and experimental observations, providing a quantitative measure of the confidence of machine learning models trained on either DFT or CALPHAD input in accounting for experimental evidence. The experimental dataset was generated via high-throughput in-situ synchrotron X-ray diffraction on compositionally varied FeNiMnCr alloy libraries, heated from room temperature to ~1000 °C. Agreement between the observed and predicted phases was evaluated using either temperature-independent phase classification or a model that incorporates a temperature-dependent probability of phase formation. This integrated approach demonstrates where strong overall agreement between computation and experiment exists, while also identifying key discrepancies, particularly in FCC/BCC predictions at Mn-rich regions to inform future model refinement.
☆ Targeted Manipulation: Slope-Based Attacks on Financial Time-Series Data
A common method of attacking deep learning models is through adversarial attacks, which occur when an attacker specifically modifies the input of a model to produce an incorrect result. Adversarial attacks have been deeply investigated in the image domain; however, there is less research in the time-series domain and very little for forecasting financial data. To address these concerns, this study aims to build upon previous research on adversarial attacks for time-series data by introducing two new slope-based methods aimed to alter the trends of the predicted stock forecast generated by an N-HiTS model. Compared to the normal N-HiTS predictions, the two new slope-based methods, the General Slope Attack and Least-Squares Slope Attack, can manipulate N-HiTS predictions by doubling the slope. These new slope attacks can bypass standard security mechanisms, such as a discriminator that filters real and perturbed inputs, reducing a 4-layered CNN's specificity to 28% and accuracy to 57%. Furthermore, the slope based methods were incorporated into a GAN architecture as a means of generating realistic synthetic data, while simultaneously fooling the model. Finally, this paper also proposes a sample malware designed to inject an adversarial attack in the model inference library, proving that ML-security research should not only focus on making the model safe, but also securing the entire pipeline.
comment: 13 pages, 6 figures, 4 tables, preprint; Total including Appendix: 21 pages, 11 figures, 7 tables
☆ Understanding the Staged Dynamics of Transformers in Learning Latent Structure
While transformers can discover latent structure from context, the dynamics of how they acquire different components of the latent structure remain poorly understood. In this work, we use the Alchemy benchmark, to investigate the dynamics of latent structure learning. We train a small decoder-only transformer on three task variants: 1) inferring missing rules from partial contextual information, 2) composing simple rules to solve multi-step sequences, and 3) decomposing complex multi-step examples to infer intermediate steps. By factorizing each task into interpretable events, we show that the model acquires capabilities in discrete stages, first learning the coarse grained rules, before learning the complete latent structure. We also identify a crucial asymmetry, where the model can compose fundamental rules robustly, but struggles to decompose complex examples to discover the fundamental rules. These findings offer new insights into understanding how a transformer model learns latent structures, providing a granular view of how these capabilities evolve during training.
comment: Preprint
☆ PRInTS: Reward Modeling for Long-Horizon Information Seeking
Information-seeking is a core capability for AI agents, requiring them to gather and reason over tool-generated information across long trajectories. However, such multi-step information-seeking tasks remain challenging for agents backed by language models. While process reward models (PRMs) can guide agents by ranking candidate steps at test-time, existing PRMs, designed for short reasoning with binary judgment, cannot capture richer dimensions of information-seeking steps, such as tool interactions and reasoning over tool outputs, nor handle the rapidly growing context in long-horizon tasks. To address these limitations, we introduce PRInTS, a generative PRM trained with dual capabilities: (1) dense scoring based on the PRM's reasoning across multiple step quality dimensions (e.g., interpretation of tool outputs, tool call informativeness) and (2) trajectory summarization that compresses the growing context while preserving essential information for step evaluation. Extensive evaluations across FRAMES, GAIA (levels 1-3), and WebWalkerQA (easy-hard) benchmarks on multiple models, along with ablations, reveal that best-of-n sampling with PRInTS enhances information-seeking abilities of open-source models as well as specialized agents, matching or surpassing the performance of frontier models with a much smaller backbone agent and outperforming other strong reward modeling baselines.
comment: 18 pages, code: https://github.com/G-JWLee/PRInTS
☆ AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning
Humans naturally adapt to diverse environments by learning underlying rules across worlds with different dynamics, observations, and reward structures. In contrast, existing agents typically demonstrate improvements via self-evolving within a single domain, implicitly assuming a fixed environment distribution. Cross-environment learning has remained largely unmeasured: there is no standard collection of controllable, heterogeneous environments, nor a unified way to represent how agents learn. We address these gaps in two steps. First, we propose AutoEnv, an automated framework that treats environments as factorizable distributions over transitions, observations, and rewards, enabling low-cost (4.12 USD on average) generation of heterogeneous worlds. Using AutoEnv, we construct AutoEnv-36, a dataset of 36 environments with 358 validated levels, on which seven language models achieve 12-49% normalized reward, demonstrating the challenge of AutoEnv-36. Second, we formalize agent learning as a component-centric process driven by three stages of Selection, Optimization, and Evaluation applied to an improvable agent component. Using this formulation, we design eight learning methods and evaluate them on AutoEnv-36. Empirically, the gain of any single learning method quickly decrease as the number of environments increases, revealing that fixed learning methods do not scale across heterogeneous environments. Environment-adaptive selection of learning methods substantially improves performance but exhibits diminishing returns as the method space expands. These results highlight both the necessity and the current limitations of agent learning for scalable cross-environment generalization, and position AutoEnv and AutoEnv-36 as a testbed for studying cross-environment agent learning. The code is avaiable at https://github.com/FoundationAgents/AutoEnv.
☆ Open-weight genome language model safeguards: Assessing robustness via adversarial fine-tuning NeurIPS 2025
Novel deep learning architectures are increasingly being applied to biological data, including genetic sequences. These models, referred to as genomic language mod- els (gLMs), have demonstrated impressive predictive and generative capabilities, raising concerns that such models may also enable misuse, for instance via the generation of genomes for human-infecting viruses. These concerns have catalyzed calls for risk mitigation measures. The de facto mitigation of choice is filtering of pretraining data (i.e., removing viral genomic sequences from training datasets) in order to limit gLM performance on virus-related tasks. However, it is not currently known how robust this approach is for securing open-source models that can be fine-tuned using sensitive pathogen data. Here, we evaluate a state-of-the-art gLM, Evo 2, and perform fine-tuning using sequences from 110 harmful human-infecting viruses to assess the rescue of misuse-relevant predictive capabilities. The fine- tuned model exhibited reduced perplexity on unseen viral sequences relative to 1) the pretrained model and 2) a version fine-tuned on bacteriophage sequences. The model fine-tuned on human-infecting viruses also identified immune escape variants from SARS-CoV-2 (achieving an AUROC of 0.6), despite having no expo- sure to SARS-CoV-2 sequences during fine-tuning. This work demonstrates that data exclusion might be circumvented by fine-tuning approaches that can, to some degree, rescue misuse-relevant capabilities of gLMs. We highlight the need for safety frameworks for gLMs and outline further work needed on evaluations and mitigation measures to enable the safe deployment of gLMs.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Biosecurity Safeguards for Generative AI
☆ TorchQuantumDistributed NeurIPS 2025
TorchQuantumDistributed (tqd) is a PyTorch-based [Paszke et al., 2019] library for accelerator-agnostic differentiable quantum state vector simulation at scale. This enables studying the behavior of learnable parameterized near-term and fault- tolerant quantum circuits with high qubit counts.
comment: 12 pages, 4 figures, to appear in the AI for Science Workshop at NeurIPS 2025
☆ Performance Guarantees for Quantum Neural Estimation of Entropies
Estimating quantum entropies and divergences is an important problem in quantum physics, information theory, and machine learning. Quantum neural estimators (QNEs), which utilize a hybrid classical-quantum architecture, have recently emerged as an appealing computational framework for estimating these measures. Such estimators combine classical neural networks with parametrized quantum circuits, and their deployment typically entails tedious tuning of hyperparameters controlling the sample size, network architecture, and circuit topology. This work initiates the study of formal guarantees for QNEs of measured (Rényi) relative entropies in the form of non-asymptotic error risk bounds. We further establish exponential tail bounds showing that the error is sub-Gaussian, and thus sharply concentrates about the ground truth value. For an appropriate sub-class of density operator pairs on a space of dimension $d$ with bounded Thompson metric, our theory establishes a copy complexity of $O(|Θ(\mathcal{U})|d/ε^2)$ for QNE with a quantum circuit parameter set $Θ(\mathcal{U})$, which has minimax optimal dependence on the accuracy $ε$. Additionally, if the density operator pairs are permutation invariant, we improve the dimension dependence above to $O(|Θ(\mathcal{U})|\mathrm{polylog}(d)/ε^2)$. Our theory aims to facilitate principled implementation of QNEs for measured relative entropies and guide hyperparameter tuning in practice.
comment: 42+4 pages
☆ The Unified Non-Convex Framework for Robust Causal Inference: Overcoming the Gaussian Barrier and Optimization Fragility
This document proposes a Unified Robust Framework that re-engineers the estimation of the Average Treatment Effect on the Overlap (ATO). It synthesizes gamma-Divergence for outlier robustness, Graduated Non-Convexity (GNC) for global optimization, and a "Gatekeeper" mechanism to address the impossibility of higher-order orthogonality in Gaussian regimes.
comment: 10 pages, 1 table
☆ MapFormer: Self-Supervised Learning of Cognitive Maps with Input-Dependent Positional Embeddings
A cognitive map is an internal model which encodes the abstract relationships among entities in the world, giving humans and animals the flexibility to adapt to new situations, with a strong out-of-distribution (OOD) generalization that current AI systems still do not possess. To bridge this gap, we introduce MapFormers, new architectures based on Transformer models, which can learn cognitive maps from observational data and perform path integration in parallel, in a self-supervised manner. Cognitive maps are learned in the model by disentangling structural relationships in the inputs from their specific content, a property that can be achieved naturally by updating the positional encoding in Transformers with input-dependent matrices. We developed two variants of MapFormers that unify absolute and relative positional encoding to model episodic (EM) and working memory (WM), respectively. We tested MapFormers on several tasks, including a classic 2D navigation task, showing that our models can learn a cognitive map of the underlying space and generalize OOD (e.g., to longer sequences) with near-perfect performance, unlike current architectures. Together, these results demonstrate the superiority of models designed to learn a cognitive map, and the importance of introducing a structural bias for structure-content disentanglement, which can be achieved in Transformers with input-dependent positional encoding. MapFormers have broad applications in both neuroscience and AI, by explaining the neural mechanisms giving rise to cognitive maps, while allowing these relation models to be learned at scale.
comment: 19 pages (29 with appendix), 8 figures
☆ Closing Gaps in Emissions Monitoring with Climate TRACE
Global greenhouse gas emissions estimates are essential for monitoring and mitigation planning. Yet most datasets lack one or more characteristics that enhance their actionability, such as accuracy, global coverage, high spatial and temporal resolution, and frequent updates. To address these gaps, we present Climate TRACE (climatetrace.org), an open-access platform delivering global emissions estimates with enhanced detail, coverage, and timeliness. Climate TRACE synthesizes existing emissions data, prioritizing accuracy, coverage, and resolution, and fills gaps using sector-specific estimation approaches. The dataset is the first to provide globally comprehensive emissions estimates for individual sources (e.g., individual power plants) for all anthropogenic emitting sectors. The dataset spans January 1, 2021, to the present, with a two-month reporting lag and monthly updates. The open-access platform enables non-technical audiences to engage with detailed emissions datasets for most subnational governments worldwide. Climate TRACE supports data-driven climate action at scales where decisions are made, representing a major breakthrough for emissions accounting and mitigation.
☆ Scalable Bayesian Network Structure Learning Using Tsetlin Machine to Constrain the Search Space
The PC algorithm is a widely used method in causal inference for learning the structure of Bayesian networks. Despite its popularity, the PC algorithm suffers from significant time complexity, particularly as the size of the dataset increases, which limits its applicability in large-scale real-world problems. In this study, we propose a novel approach that utilises the Tsetlin Machine (TM) to construct Bayesian structures more efficiently. Our method leverages the most significant literals extracted from the TM and performs conditional independence (CI) tests on these selected literals instead of the full set of variables, resulting in a considerable reduction in computational time. We implemented our approach and compared it with various state-of-the-art methods. Our evaluation includes categorical datasets from the bnlearn repository, such as Munin1, Hepar2. The findings indicate that the proposed TM-based method not only reduces computational complexity but also maintains competitive accuracy in causal discovery, making it a viable alternative to traditional PC algorithm implementations by offering improved efficiency without compromising performance.
☆ Tiny-TSM: Efficiently Training a Lightweight SOTA Time Series Foundation Model
We present Tiny-TSM, a time series foundation model characterized by small scale, economical training, and state-of-the-art performance. It comprises 23M total parameters, trained on a single A100 GPU in less than a week using a new synthetic data generation and data augmentation pipeline (SynthTS). Without any neural architecture search, hyperparameter tuning, or scaling up model size, Tiny-TSM achieves state-of-the-art performance on a wide range of time series benchmark datasets, often outperforming much larger models and even matching the performance of much larger, industrial-scale, likely highly tuned foundation models. Specifically, Tiny-TSM outperforms all other time series foundation models we evaluated on medium- and long-term forecasting tasks under MSE loss, while short-term accuracy is still competitive with state-of-the-art models. We also introduce a causal input normalization scheme that enables time series models to be trained with dense next-token prediction loss, significantly accelerating convergence speed and reducing training time. All experiments were conducted on a single A100 GPU, illustrating the practicality of the proposed approach in a resource-constrained setting.
☆ CDLM: Consistency Diffusion Language Models For Faster Sampling
Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.
comment: 18 pages, 6 figures
☆ Leveraging Spatiotemporal Graph Neural Networks for Multi-Store Sales Forecasting
This work evaluates the effectiveness of spatiotemporal Graph Neural Networks (GNNs) for multi-store retail sales forecasting and compares their performance against ARIMA, LSTM, and XGBoost baselines. Using weekly sales data from 45 Walmart stores, we construct a relational forecasting framework that models inter-store dependencies through a learned adaptive graph. The proposed STGNN predicts log-differenced sales and reconstructs final values through a residual path, enabling stable training and improved generalisation. Experiments show that STGNN achieves the lowest overall forecasting error, outperforming all baselines in Normalised Total Absolute Error, P90 MAPE, and variance of MAPE across stores. Analysis of the learned adjacency matrix reveals meaningful functional store clusters and high-influence nodes that emerge without geographic metadata. These results demonstrate that relational structure significantly improves forecast quality in interconnected retail environments and establishes STGNNs as a robust modelling choice for multi-store demand prediction.
comment: 6 pages, 4 figures, 1 table
☆ Unboxing the Black Box: Mechanistic Interpretability for Algorithmic Understanding of Neural Networks
The black box nature of deep neural networks poses a significant challenge for the deployment of transparent and trustworthy artificial intelligence (AI) systems. With the growing presence of AI in society, it becomes increasingly important to develop methods that can explain and interpret the decisions made by these systems. To address this, mechanistic interpretability (MI) emerged as a promising and distinctive research program within the broader field of explainable artificial intelligence (XAI). MI is the process of studying the inner computations of neural networks and translating them into human-understandable algorithms. It encompasses reverse engineering techniques aimed at uncovering the computational algorithms implemented by neural networks. In this article, we propose a unified taxonomy of MI approaches and provide a detailed analysis of key techniques, illustrated with concrete examples and pseudo-code. We contextualize MI within the broader interpretability landscape, comparing its goals, methods, and insights to other strands of XAI. Additionally, we trace the development of MI as a research area, highlighting its conceptual roots and the accelerating pace of recent work. We argue that MI holds significant potential to support a more scientific understanding of machine learning systems -- treating models not only as tools for solving tasks, but also as systems to be studied and understood. We hope to invite new researchers into the field of mechanistic interpretability.
☆ Interpreting GFlowNets for Drug Discovery: Extracting Actionable Insights for Medicinal Chemistry NeurIPS 2025
Generative Flow Networks, or GFlowNets, offer a promising framework for molecular design, but their internal decision policies remain opaque. This limits adoption in drug discovery, where chemists require clear and interpretable rationales for proposed structures. We present an interpretability framework for SynFlowNet, a GFlowNet trained on documented chemical reactions and purchasable starting materials that generates both molecules and the synthetic routes that produce them. Our approach integrates three complementary components. Gradient based saliency combined with counterfactual perturbations identifies which atomic environments influence reward and how structural edits change molecular outcomes. Sparse autoencoders reveal axis aligned latent factors that correspond to physicochemical properties such as polarity, lipophilicity, and molecular size. Motif probes show that functional groups including aromatic rings and halogens are explicitly encoded and linearly decodable from the internal embeddings. Together, these results expose the chemical logic inside SynFlowNet and provide actionable and mechanistic insight that supports transparent and controllable molecular design.
comment: 13 pages, 7 figures. Accepted for presentation at NeurIPS 2025 WiML Workshop and Molecular Machine Learning Conference (MoML) 2025
☆ Solar-GECO: Perovskite Solar Cell Property Prediction with Geometric-Aware Co-Attention NeurIPS 2025
Perovskite solar cells are promising candidates for next-generation photovoltaics. However, their performance as multi-scale devices is determined by complex interactions between their constituent layers. This creates a vast combinatorial space of possible materials and device architectures, making the conventional experimental-based screening process slow and expensive. Machine learning models try to address this problem, but they only focus on individual material properties or neglect the important geometric information of the perovskite crystal. To address this problem, we propose to predict perovskite solar cell power conversion efficiency with a geometric-aware co-attention (Solar-GECO) model. Solar-GECO combines a geometric graph neural network (GNN) - that directly encodes the atomic structure of the perovskite absorber - with language model embeddings that process the textual strings representing the chemical compounds of the transport layers and other device components. Solar-GECO also integrates a co-attention module to capture intra-layer dependencies and inter-layer interactions, while a probabilistic regression head predicts both power conversion efficiency (PCE) and its associated uncertainty. Solar-GECO achieves state-of-the-art performance, significantly outperforming several baselines, reducing the mean absolute error (MAE) for PCE prediction from 3.066 to 2.936 compared to semantic GNN (the previous state-of-the-art model). Solar-GECO demonstrates that integrating geometric and textual information provides a more powerful and accurate framework for PCE prediction.
comment: Accepted at the AI for Accelerated Materials Design (AI4Mat) Workshop at NeurIPS 2025. 14 pages, 4 figures
☆ Psychometric Tests for AI Agents and Their Moduli Space
We develop a moduli-theoretic view of psychometric test batteries for AI agents and connect it explicitly to the AAI score developed previously. First, we make precise the notion of an AAI functional on a battery and set out axioms that any reasonable autonomy/general intelligence score should satisfy. Second, we show that the composite index ('AAI-Index') defined previously is a special case of our AAI functional. Third, we introduce the notion of a cognitive core of an agent relative to a battery and define the associated AAI$_{\textrm{core}}$ score as the restriction of an AAI functional to that core. Finally, we use these notions to describe invariants of batteries under evaluation-preserving symmetries and outline how moduli of equivalent batteries are organized.
☆ A Nutrition Multimodal Photoplethysmography Language Model
Hunger and satiety dynamics shape dietary behaviors and metabolic health, yet remain difficult to capture in everyday settings. We present a Nutrition Photoplethysmography Language Model (NPLM), integrating continuous photoplethysmography (PPG) from wearables with meal descriptions. NPLM projects PPG into embeddings interpretable by language models, enabling joint reasoning over physiology and meal context. Trained on 19,340 participants and 1.1 million meal-PPG pairs, the model improved daily caloric intake prediction by 11% over text-only baselines, with accuracy maintained when 80% of meal text was removed. In an independent validation study (n=140) with controlled dining and detailed meal information, the model replicated these findings. These results demonstrate the value of integrating physiological measurements from consumer wearables with meal information for noninvasive dietary monitoring at scale.
comment: 21 pages, 2 figures
☆ Medusa: Cross-Modal Transferable Adversarial Attacks on Multimodal Medical Retrieval-Augmented Generation KDD 2026
With the rapid advancement of retrieval-augmented vision-language models, multimodal medical retrieval-augmented generation (MMed-RAG) systems are increasingly adopted in clinical decision support. These systems enhance medical applications by performing cross-modal retrieval to integrate relevant visual and textual evidence for tasks, e.g., report generation and disease diagnosis. However, their complex architecture also introduces underexplored adversarial vulnerabilities, particularly via visual input perturbations. In this paper, we propose Medusa, a novel framework for crafting cross-modal transferable adversarial attacks on MMed-RAG systems under a black-box setting. Specifically, Medusa formulates the attack as a perturbation optimization problem, leveraging a multi-positive InfoNCE loss (MPIL) to align adversarial visual embeddings with medically plausible but malicious textual targets, thereby hijacking the retrieval process. To enhance transferability, we adopt a surrogate model ensemble and design a dual-loop optimization strategy augmented with invariant risk minimization (IRM). Extensive experiments on two real-world medical tasks, including medical report generation and disease diagnosis, demonstrate that Medusa achieves over 90% average attack success rate across various generation models and retrievers under appropriate parameter configuration, while remaining robust against four mainstream defenses, outperforming state-of-the-art baselines. Our results reveal critical vulnerabilities in the MMed-RAG systems and highlight the necessity of robustness benchmarking in safety-critical medical applications. The code and data are available at https://anonymous.4open.science/r/MMed-RAG-Attack-F05A.
comment: Accepted at KDD 2026 First Cycle (full version). Authors marked with * contributed equally. Yi Liu is the lead author
☆ SimDiff: Simpler Yet Better Diffusion Model for Time Series Point Forecasting AAAI 2026
Diffusion models have recently shown promise in time series forecasting, particularly for probabilistic predictions. However, they often fail to achieve state-of-the-art point estimation performance compared to regression-based methods. This limitation stems from difficulties in providing sufficient contextual bias to track distribution shifts and in balancing output diversity with the stability and precision required for point forecasts. Existing diffusion-based approaches mainly focus on full-distribution modeling under probabilistic frameworks, often with likelihood maximization objectives, while paying little attention to dedicated strategies for high-accuracy point estimation. Moreover, other existing point prediction diffusion methods frequently rely on pre-trained or jointly trained mature models for contextual bias, sacrificing the generative flexibility of diffusion models. To address these challenges, we propose SimDiff, a single-stage, end-to-end framework. SimDiff employs a single unified Transformer network carefully tailored to serve as both denoiser and predictor, eliminating the need for external pre-trained or jointly trained regressors. It achieves state-of-the-art point estimation performance by leveraging intrinsic output diversity and improving mean squared error accuracy through multiple inference ensembling. Key innovations, including normalization independence and the median-of-means estimator, further enhance adaptability and stability. Extensive experiments demonstrate that SimDiff significantly outperforms existing methods in time series point forecasting.
comment: Accepted by AAAI 2026
☆ MAESTRO: Multi-Agent Environment Shaping through Task and Reward Optimization
Cooperative Multi-Agent Reinforcement Learning (MARL) faces two major design bottlenecks: crafting dense reward functions and constructing curricula that avoid local optima in high-dimensional, non-stationary environments. Existing approaches rely on fixed heuristics or use Large Language Models (LLMs) directly in the control loop, which is costly and unsuitable for real-time systems. We propose MAESTRO (Multi-Agent Environment Shaping through Task and Reward Optimization), a framework that moves the LLM outside the execution loop and uses it as an offline training architect. MAESTRO introduces two generative components: (i) a semantic curriculum generator that creates diverse, performance-driven traffic scenarios, and (ii) an automated reward synthesizer that produces executable Python reward functions adapted to evolving curriculum difficulty. These components guide a standard MARL backbone (MADDPG) without increasing inference cost at deployment. We evaluate MAESTRO on large-scale traffic signal control (Hangzhou, 16 intersections) and conduct controlled ablations. Results show that combining LLM-generated curricula with LLM-generated reward shaping yields improved performance and stability. Across four seeds, the full system achieves +4.0% higher mean return (163.26 vs. 156.93) and 2.2% better risk-adjusted performance (Sharpe 1.53 vs. 0.70) over a strong curriculum baseline. These findings highlight LLMs as effective high-level designers for cooperative MARL training.
comment: Preprint. 16 pages, 6 figures. Preliminary version; extended experiments and analysis forthcoming
☆ Neural Architecture Search for Quantum Autoencoders
In recent years, machine learning and deep learning have driven advances in domains such as image classification, speech recognition, and anomaly detection by leveraging multi-layer neural networks to model complex data. Simultaneously, quantum computing (QC) promises to address classically intractable problems via quantum parallelism, motivating research in quantum machine learning (QML). Among QML techniques, quantum autoencoders show promise for compressing high-dimensional quantum and classical data. However, designing effective quantum circuit architectures for quantum autoencoders remains challenging due to the complexity of selecting gates, arranging circuit layers, and tuning parameters. This paper proposes a neural architecture search (NAS) framework that automates the design of quantum autoencoders using a genetic algorithm (GA). By systematically evolving variational quantum circuit (VQC) configurations, our method seeks to identify high-performing hybrid quantum-classical autoencoders for data reconstruction without becoming trapped in local minima. We demonstrate effectiveness on image datasets, highlighting the potential of quantum autoencoders for efficient feature extraction within a noise-prone, near-term quantum era. Our approach lays a foundation for broader application of genetic algorithms to quantum architecture search, aiming for a robust, automated method that can adapt to varied data and hardware constraints.
☆ Local Entropy Search over Descent Sequences for Bayesian Optimization
Searching large and complex design spaces for a global optimum can be infeasible and unnecessary. A practical alternative is to iteratively refine the neighborhood of an initial design using local optimization methods such as gradient descent. We propose local entropy search (LES), a Bayesian optimization paradigm that explicitly targets the solutions reachable by the descent sequences of iterative optimizers. The algorithm propagates the posterior belief over the objective through the optimizer, resulting in a probability distribution over descent sequences. It then selects the next evaluation by maximizing mutual information with that distribution, using a combination of analytic entropy calculations and Monte-Carlo sampling of descent sequences. Empirical results on high-complexity synthetic objectives and benchmark problems show that LES achieves strong sample efficiency compared to existing local and global Bayesian optimization methods.
☆ Empirical Comparison of Forgetting Mechanisms for UCB-based Algorithms on a Data-Driven Simulation Platform
Many real-world bandit problems involve non-stationary reward distributions, where the optimal decision may shift due to evolving environments. However, the performance of some typical Multi-Armed Bandit (MAB) models such as Upper Confidence Bound (UCB) algorithms degrades significantly in non-stationary environments where reward distributions change over time. To address this limitation, this paper introduces and evaluates FDSW-UCB, a novel dual-view algorithm that integrates a discount-based long-term perspective with a sliding-window-based short-term view. A data-driven semi-synthetic simulation platform, built upon the MovieLens-1M and Open Bandit datasets, is developed to test algorithm adaptability under abrupt and gradual drift scenarios. Experimental results demonstrate that a well-configured sliding-window mechanism (SW-UCB) is robust, while the widely used discounting method (D-UCB) suffers from a fundamental learning failure, leading to linear regret. Crucially, the proposed FDSW-UCB, when employing an optimistic aggregation strategy, achieves superior performance in dynamic settings, highlighting that the ensemble strategy itself is a decisive factor for success.
☆ CLASH: A Benchmark for Cross-Modal Contradiction Detection
Contradictory multimodal inputs are common in real-world settings, yet existing benchmarks typically assume input consistency and fail to evaluate cross-modal contradiction detection - a fundamental capability for preventing hallucinations and ensuring reliability. We introduce CLASH, a novel benchmark for multimodal contradiction detection, featuring COCO images paired with contradictory captions containing controlled object-level or attribute-level contradictions. The samples include targeted questions evaluated in both multiple-choice and open-ended formats. The benchmark provides an extensive fine-tuning set filtered through automated quality checks, alongside a smaller human-verified diagnostic set. Our analysis of state-of-the-art models reveals substantial limitations in recognizing cross-modal conflicts, exposing systematic modality biases and category-specific weaknesses. Furthermore, we empirically demonstrate that targeted fine-tuning on CLASH substantially enhances conflict detection capabilities.
comment: First two authors contributed equally
☆ SpectraNet: FFT-assisted Deep Learning Classifier for Deepfake Face Detection
Detecting deepfake images is crucial in combating misinformation. We present a lightweight, generalizable binary classification model based on EfficientNet-B6, fine-tuned with transformation techniques to address severe class imbalances. By leveraging robust preprocessing, oversampling, and optimization strategies, our model achieves high accuracy, stability, and generalization. While incorporating Fourier transform-based phase and amplitude features showed minimal impact, our proposed framework helps non-experts to effectively identify deepfake images, making significant strides toward accessible and reliable deepfake detection.
comment: 4 pages, 3 figures
☆ From Raw Features to Effective Embeddings: A Three-Stage Approach for Multimodal Recipe Recommendation
Recipe recommendation has become an essential task in web-based food platforms. A central challenge is effectively leveraging rich multimodal features beyond user-recipe interactions. Our analysis shows that even simple uses of multimodal signals yield competitive performance, suggesting that systematic enhancement of these signals is highly promising. We propose TESMR, a 3-stage framework for recipe recommendation that progressively refines raw multimodal features into effective embeddings through: (1) content-based enhancement using foundation models with multimodal comprehension, (2) relation-based enhancement via message propagation over user-recipe interactions, and (3) learning-based enhancement through contrastive learning with learnable embeddings. Experiments on two real-world datasets show that TESMR outperforms existing methods, achieving 7-15% higher Recall@10.
☆ RAVEN++: Pinpointing Fine-Grained Violations in Advertisement Videos with Active Reinforcement Reasoning EMNLP 2025
Advertising (Ad) is a cornerstone of the digital economy, yet the moderation of video advertisements remains a significant challenge due to their complexity and the need for precise violation localization. While recent advancements, such as the RAVEN model, have improved coarse-grained violation detection, critical gaps persist in fine-grained understanding, explainability, and generalization. To address these limitations, we propose RAVEN++, a novel framework that introduces three key innovations: 1) Active Reinforcement Learning (RL), which dynamically adapts training to samples of varying difficulty; 2) Fine-Grained Violation Understanding, achieved through hierarchical reward functions and reasoning distillation; and 3) Progressive Multi-Stage Training, which systematically combines knowledge injection, curriculum-based passive RL, and active RL. Extensive experiments on both public and proprietary datasets, on both offline scenarios and online deployed A/B Testing, demonstrate that RAVEN++ outperforms general-purpose LLMs and specialized models like RAVEN in terms of fine-grained violation understanding, reasoning capabilities, and generalization ability.
comment: EMNLP 2025 (Oral, Industry Track)
☆ First-order Sobolev Reinforcement Learning
We propose a refinement of temporal-difference learning that enforces first-order Bellman consistency: the learned value function is trained to match not only the Bellman targets in value but also their derivatives with respect to states and actions. By differentiating the Bellman backup through differentiable dynamics, we obtain analytically consistent gradient targets. Incorporating these into the critic objective using a Sobolev-type loss encourages the critic to align with both the value and local geometry of the target function. This first-order TD matching principle can be seamlessly integrated into existing algorithms, such as Q-learning or actor-critic methods (e.g., DDPG, SAC), potentially leading to faster critic convergence and more stable policy gradients without altering their overall structure.
comment: Workshop paper at Differentiable Systems and Scientific Machine Learning, EurIPS 2025
☆ A Robust State Filter Against Unmodeled Process And Measurement Noise
This paper introduces a novel Kalman filter framework designed to achieve robust state estimation under both process and measurement noise. Inspired by the Weighted Observation Likelihood Filter (WoLF), which provides robustness against measurement outliers, we applied generalized Bayesian approach to build a framework considering both process and measurement noise outliers.
☆ Masked Diffusion Models are Secretly Learned-Order Autoregressive Models
Masked Diffusion Models (MDMs) have emerged as one of the most promising paradigms for generative modeling over discrete domains. It is known that MDMs effectively train to decode tokens in a random order, and that this ordering has significant performance implications in practice. This observation raises a fundamental question: can we design a training framework that optimizes for a favorable decoding order? We answer this in the affirmative, showing that the continuous-time variational objective of MDMs, when equipped with multivariate noise schedules, can identify and optimize for a decoding order during training. We establish a direct correspondence between decoding order and the multivariate noise schedule and show that this setting breaks invariance of the MDM objective to the noise schedule. Furthermore, we prove that the MDM objective decomposes precisely into a weighted auto-regressive losses over these orders, which establishes them as auto-regressive models with learnable orders.
comment: Accepted at EurIPS 2025 Workshop on Principles of Generative Modeling (PriGM)
☆ Feature Ranking in Credit-Risk with Qudit-Based Networks
In finance, predictive models must balance accuracy and interpretability, particularly in credit risk assessment, where model decisions carry material consequences. We present a quantum neural network (QNN) based on a single qudit, in which both data features and trainable parameters are co-encoded within a unified unitary evolution generated by the full Lie algebra. This design explores the entire Hilbert space while enabling interpretability through the magnitudes of the learned coefficients. We benchmark our model on a real-world, imbalanced credit-risk dataset from Taiwan. The proposed QNN consistently outperforms LR and reaches the results of random forest models in macro-F1 score while preserving a transparent correspondence between learned parameters and input feature importance. To quantify the interpretability of the proposed model, we introduce two complementary metrics: (i) the edit distance between the model's feature ranking and that of LR, and (ii) a feature-poisoning test where selected features are replaced with noise. Results indicate that the proposed quantum model achieves competitive performance while offering a tractable path toward interpretable quantum learning.
☆ Collaborative Learning with Multiple Foundation Models for Source-Free Domain Adaptation
Source-Free Domain Adaptation (SFDA) aims to adapt a pre-trained source model to an unlabeled target domain without access to source data. Recent advances in Foundation Models (FMs) have introduced new opportunities for leveraging external semantic knowledge to guide SFDA. However, relying on a single FM is often insufficient, as it tends to bias adaptation toward a restricted semantic coverage, failing to capture diverse contextual cues under domain shift. To overcome this limitation, we propose a Collaborative Multi-foundation Adaptation (CoMA) framework that jointly leverages two different FMs (e.g., CLIP and BLIP) with complementary properties to capture both global semantics and local contextual cues. Specifically, we employ a bidirectional adaptation mechanism that (1) aligns different FMs with the target model for task adaptation while maintaining their semantic distinctiveness, and (2) transfers complementary knowledge from the FMs to the target model. To ensure stable adaptation under mini-batch training, we introduce Decomposed Mutual Information (DMI) that selectively enhances true dependencies while suppressing false dependencies arising from incomplete class coverage. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art SFDA methods across four benchmarks, including Office-31, Office-Home, DomainNet-126, and VisDA, under the closed-set setting, while also achieving best results on partial-set and open-set variants.
comment: 15 pages, 8 figures
☆ Uncertainty-Aware Deep Learning Framework for Remaining Useful Life Prediction in Turbofan Engines with Learned Aleatoric Uncertainty
Accurate Remaining Useful Life (RUL) prediction coupled with uncertainty quantification remains a critical challenge in aerospace prognostics. This research introduces a novel uncertainty-aware deep learning framework that learns aleatoric uncertainty directly through probabilistic modeling, an approach unexplored in existing CMAPSS-based literature. Our hierarchical architecture integrates multi-scale Inception blocks for temporal pattern extraction, bidirectional Long Short-Term Memory networks for sequential modeling, and a dual-level attention mechanism operating simultaneously on sensor and temporal dimensions. The innovation lies in the Bayesian output layer that predicts both mean RUL and variance, enabling the model to learn data-inherent uncertainty. Comprehensive preprocessing employs condition-aware clustering, wavelet denoising, and intelligent feature selection. Experimental validation on NASA CMAPSS benchmarks (FD001-FD004) demonstrates competitive overall performance with RMSE values of 16.22, 19.29, 16.84, and 19.98 respectively. Remarkably, our framework achieves breakthrough critical zone performance (RUL <= 30 cycles) with RMSE of 5.14, 6.89, 5.27, and 7.16, representing 25-40 percent improvements over conventional approaches and establishing new benchmarks for safety-critical predictions. The learned uncertainty provides well-calibrated 95 percent confidence intervals with coverage ranging from 93.5 percent to 95.2 percent, enabling risk-aware maintenance scheduling previously unattainable in CMAPSS literature.
comment: 10 pages, 2 figures, 3 tables. Submitted to arXiv
☆ The Core in Max-Loss Non-Centroid Clustering Can Be Empty
We study core stability in non-centroid clustering under the max-loss objective, where each agent's loss is the maximum distance to other members of their cluster. We prove that for all $k\geq 3$ there exist metric instances with $n\ge 9$ agents, with $n$ divisible by $k$, for which no clustering lies in the $α$-core for any $α<2^{\frac{1}{5}}\sim 1.148$. The bound is tight for our construction. Using a computer-aided proof, we also identify a two-dimensional Euclidean point set whose associated lower bound is slightly smaller than that of our general construction. This is, to our knowledge, the first impossibility result showing that the core can be empty in non-centroid clustering under the max-loss objective.
☆ Edge-Based Predictive Data Reduction for Smart Agriculture: A Lightweight Approach to Efficient IoT Communication
The rapid growth of IoT devices has led to an enormous amount of sensor data that requires transmission to cloud servers for processing, resulting in excessive network congestion, increased latency and high energy consumption. This is particularly problematic in resource-constrained and remote environments where bandwidth is limited, and battery-dependent devices further emphasize the problem. Moreover, in domains such as agriculture, consecutive sensor readings often have minimal variation, making continuous data transmission inefficient and unnecessarily resource intensive. To overcome these challenges, we propose an analytical prediction algorithm designed for edge computing environments and validated through simulation. The proposed solution utilizes a predictive filter at the network edge that forecasts the next sensor data point and triggers data transmission only when the deviation from the predicted value exceeds a predefined tolerance. A complementary cloud-based model ensures data integrity and overall system consistency. This dual-model strategy effectively reduces communication overhead and demonstrates potential for improving energy efficiency by minimizing redundant transmissions. In addition to reducing communication load, our approach leverages both in situ and satellite observations from the same locations to enhance model robustness. It also supports cross-site generalization, enabling models trained in one region to be effectively deployed elsewhere without retraining. This makes our solution highly scalable, energy-aware, and well-suited for optimizing sensor data transmission in remote and bandwidth-constrained IoT environments.
comment: Accepted for presentation and publication in the proceedings of the IEEE Annual Congress on Artificial Intelligence of Things (IEEE AIoT 2025)
☆ Extracting Robust Register Automata from Neural Networks over Data Sequences
Automata extraction is a method for synthesising interpretable surrogates for black-box neural models that can be analysed symbolically. Existing techniques assume a finite input alphabet, and thus are not directly applicable to data sequences drawn from continuous domains. We address this challenge with deterministic register automata (DRAs), which extend finite automata with registers that store and compare numeric values. Our main contribution is a framework for robust DRA extraction from black-box models: we develop a polynomial-time robustness checker for DRAs with a fixed number of registers, and combine it with passive and active automata learning algorithms. This combination yields surrogate DRAs with statistical robustness and equivalence guarantees. As a key application, we use the extracted automata to assess the robustness of neural networks: for a given sequence and distance metric, the DRA either certifies local robustness or produces a concrete counterexample. Experiments on recurrent neural networks and transformer architectures show that our framework reliably learns accurate automata and enables principled robustness evaluation. Overall, our results demonstrate that robust DRA extraction effectively bridges neural network interpretability and formal reasoning without requiring white-box access to the underlying network.
☆ Optimization of Deep Learning Models for Dynamic Market Behavior Prediction
The advent of financial technology has witnessed a surge in the utilization of deep learning models to anticipate consumer conduct, a trend that has demonstrated considerable potential in enhancing lending strategies and bolstering market efficiency. We study multi-horizon demand forecasting on e-commerce transactions using the UCI Online Retail II dataset. Unlike prior versions of this manuscript that mixed financial-loan narratives with retail data, we focus exclusively on retail market behavior and define a clear prediction target: per SKU daily demand (or revenue) for horizons H=1,7,14. We present a hybrid sequence model that combines multi-scale temporal convolutions, a gated recurrent module, and time-aware self-attention. The model is trained with standard regression losses and evaluated under MAE, RMSE, sMAPE, MASE, and Theil's U_2 with strict time-based splits to prevent leakage. We benchmark against ARIMA/Prophet, LSTM/GRU, LightGBM, and state-of-the-art Transformer forecasters (TFT, Informer, Autoformer, N-BEATS). Results show consistent accuracy gains and improved robustness on peak/holiday periods. We further provide ablations and statistical significance tests to ensure the reliability of improvements, and we release implementation details to facilitate reproducibility.
☆ EnfoPath: Energy-Informed Analysis of Generative Trajectories in Flow Matching
Flow-based generative models synthesize data by integrating a learned velocity field from a reference distribution to the target data distribution. Prior work has focused on endpoint metrics (e.g., fidelity, likelihood, perceptual quality) while overlooking a deeper question: what do the sampling trajectories reveal? Motivated by classical mechanics, we introduce kinetic path energy (KPE), a simple yet powerful diagnostic that quantifies the total kinetic effort along each generation path of ODE-based samplers. Through comprehensive experiments on CIFAR-10 and ImageNet-256, we uncover two key phenomena: ({i}) higher KPE predicts stronger semantic quality, indicating that semantically richer samples require greater kinetic effort, and ({ii}) higher KPE inversely correlates with data density, with informative samples residing in sparse, low-density regions. Together, these findings reveal that semantically informative samples naturally reside on the sparse frontier of the data distribution, demanding greater generative effort. Our results suggest that trajectory-level analysis offers a physics-inspired and interpretable framework for understanding generation difficulty and sample characteristics.
comment: EurIPS 2025 Workshop on Principles of Generative Modeling (PriGM)
☆ Structured Matching via Cost-Regularized Unbalanced Optimal Transport
Unbalanced optimal transport (UOT) provides a flexible way to match or compare nonnegative finite Radon measures. However, UOT requires a predefined ground transport cost, which may misrepresent the data's underlying geometry. Choosing such a cost is particularly challenging when datasets live in heterogeneous spaces, often motivating practitioners to adopt Gromov-Wasserstein formulations. To address this challenge, we introduce cost-regularized unbalanced optimal transport (CR-UOT), a framework that allows the ground cost to vary while allowing mass creation and removal. We show that CR-UOT incorporates unbalanced Gromov-Wasserstein type problems through families of inner-product costs parameterized by linear transformations, enabling the matching of measures or point clouds across Euclidean spaces. We develop algorithms for such CR-UOT problems using entropic regularization and demonstrate that this approach improves the alignment of heterogeneous single-cell omics profiles, especially when many cells lack direct matches.
☆ DynaMix: Generalizable Person Re-identification via Dynamic Relabeling and Mixed Data Sampling
Generalizable person re-identification (Re-ID) aims to recognize individuals across unseen cameras and environments. While existing methods rely heavily on limited labeled multi-camera data, we propose DynaMix, a novel method that effectively combines manually labeled multi-camera and large-scale pseudo-labeled single-camera data. Unlike prior works, DynaMix dynamically adapts to the structure and noise of the training data through three core components: (1) a Relabeling Module that refines pseudo-labels of single-camera identities on-the-fly; (2) an Efficient Centroids Module that maintains robust identity representations under a large identity space; and (3) a Data Sampling Module that carefully composes mixed data mini-batches to balance learning complexity and intra-batch diversity. All components are specifically designed to operate efficiently at scale, enabling effective training on millions of images and hundreds of thousands of identities. Extensive experiments demonstrate that DynaMix consistently outperforms state-of-the-art methods in generalizable person Re-ID.
☆ Mitigating Participation Imbalance Bias in Asynchronous Federated Learning
In Asynchronous Federated Learning (AFL), the central server immediately updates the global model with each arriving client's contribution. As a result, clients perform their local training on different model versions, causing information staleness (delay). In federated environments with non-IID local data distributions, this asynchronous pattern amplifies the adverse effect of client heterogeneity (due to different data distribution, local objectives, etc.), as faster clients contribute more frequent updates, biasing the global model. We term this phenomenon heterogeneity amplification. Our work provides a theoretical analysis that maps AFL design choices to their resulting error sources when heterogeneity amplification occurs. Guided by our analysis, we propose ACE (All-Client Engagement AFL), which mitigates participation imbalance through immediate, non-buffered updates that use the latest information available from all clients. We also introduce a delay-aware variant, ACED, to balance client diversity against update staleness. Experiments on different models for different tasks across diverse heterogeneity and delay settings validate our analysis and demonstrate the robust performance of our approaches.
☆ Understanding, Accelerating, and Improving MeanFlow Training
MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.
☆ Resolving Node Identifiability in Graph Neural Processes via Laplacian Spectral Encodings
Message passing graph neural networks are widely used for learning on graphs, yet their expressive power is limited by the one-dimensional Weisfeiler-Lehman test and can fail to distinguish structurally different nodes. We provide rigorous theory for a Laplacian positional encoding that is invariant to eigenvector sign flips and to basis rotations within eigenspaces. We prove that this encoding yields node identifiability from a constant number of observations and establishes a sample-complexity separation from architectures constrained by the Weisfeiler-Lehman test. The analysis combines a monotone link between shortest-path and diffusion distance, spectral trilateration with a constant set of anchors, and quantitative spectral injectivity with logarithmic embedding size. As an instantiation, pairing this encoding with a neural-process style decoder yields significant gains on a drug-drug interaction task on chemical graphs, improving both the area under the ROC curve and the F1 score and demonstrating the practical benefits of resolving theoretical expressiveness limitations with principled positional information.
☆ OrdMoE: Preference Alignment via Hierarchical Expert Group Ranking in Multimodal Mixture-of-Experts LLMs
Preference learning has recently emerged as a pivotal strategy for post-training alignment of Multimodal Large Language Models (MLLMs). However, existing approaches predominantly rely on external human-annotated preference data, which is costly and labor-intensive to collect. In this work, we propose OrdMoE, a novel preference alignment framework that bypasses the reliance on external human preferences entirely by leveraging intrinsic signals within Mixture-of-Experts (MoE) architectures. Specifically, we observe that the router's expert selection scores implicitly encode a quality-aware ranking of responses (i.e. higher-scoring experts consistently generate higher-quality outputs). Building on this insight, OrdMoE constructs an internal preference hierarchy by grouping experts into ranked tiers based on their per-token routing scores and activating each tier separately to produce a sequence of responses with increasing quality. This yields a zero-cost, self-supervised preference ordering over generated responses, which can be directly optimized using standard preference learning objectives. Extensive experiments across multiple multimodal benchmarks demnstrate that OrdMoE significantly enhances both alignment and overall performance of multimodal Mixture-of-Experts LLMs, achieving competitive results without requiring any human-annotated preference data.
☆ 3D Dynamic Radio Map Prediction Using Vision Transformers for Low-Altitude Wireless Networks
Low-altitude wireless networks (LAWN) are rapidly expanding with the growing deployment of unmanned aerial vehicles (UAVs) for logistics, surveillance, and emergency response. Reliable connectivity remains a critical yet challenging task due to three-dimensional (3D) mobility, time-varying user density, and limited power budgets. The transmit power of base stations (BSs) fluctuates dynamically according to user locations and traffic demands, leading to a highly non-stationary 3D radio environment. Radio maps (RMs) have emerged as an effective means to characterize spatial power distributions and support radio-aware network optimization. However, most existing works construct static or offline RMs, overlooking real-time power variations and spatio-temporal dependencies in multi-UAV networks. To overcome this limitation, we propose a {3D dynamic radio map (3D-DRM)} framework that learns and predicts the spatio-temporal evolution of received power. Specially, a Vision Transformer (ViT) encoder extracts high-dimensional spatial representations from 3D RMs, while a Transformer-based module models sequential dependencies to predict future power distributions. Experiments unveil that 3D-DRM accurately captures fast-varying power dynamics and substantially outperforms baseline models in both RM reconstruction and short-term prediction.
comment: 7 pages, 4 figures, submitted to IEEE ICC 2026
☆ Classification EM-PCA for clustering and embedding
The mixture model is undoubtedly one of the greatest contributions to clustering. For continuous data, Gaussian models are often used and the Expectation-Maximization (EM) algorithm is particularly suitable for estimating parameters from which clustering is inferred. If these models are particularly popular in various domains including image clustering, they however suffer from the dimensionality and also from the slowness of convergence of the EM algorithm. However, the Classification EM (CEM) algorithm, a classifying version, offers a fast convergence solution while dimensionality reduction still remains a challenge. Thus we propose in this paper an algorithm combining simultaneously and non-sequentially the two tasks --Data embedding and Clustering-- relying on Principal Component Analysis (PCA) and CEM. We demonstrate the interest of such approach in terms of clustering and data embedding. We also establish different connections with other clustering approaches.
comment: Accepted at the IEEE conference on Big Data (Special Session on Machine Learning)
☆ Dynamic Mixture of Experts Against Severe Distribution Shifts
The challenge of building neural networks that can continuously learn and adapt to evolving data streams is central to the fields of continual learning (CL) and reinforcement learning (RL). This lifelong learning problem is often framed in terms of the plasticity-stability dilemma, focusing on issues like loss of plasticity and catastrophic forgetting. Unlike neural networks, biological brains maintain plasticity through capacity growth, inspiring researchers to explore similar approaches in artificial networks, such as adding capacity dynamically. Prior solutions often lack parameter efficiency or depend on explicit task indices, but Mixture-of-Experts (MoE) architectures offer a promising alternative by specializing experts for distinct distributions. This paper aims to evaluate a DynamicMoE approach for continual and reinforcement learning environments and benchmark its effectiveness against existing network expansion methods.
☆ FastForward Pruning: Efficient LLM Pruning via Single-Step Reinforcement Learning
Pruning is an effective method for compressing Large Language Models, but finding an optimal, non-uniform layer-wise sparsity allocation remains a key challenge. While heuristic methods are fast but yield suboptimal performance, more powerful search-based approaches like Reinforcement Learning are often hindered by prohibitive computational costs on large-scale models. To overcome this efficiency barrier, we propose FastForward Pruning. Its core is a decoupled, single-step RL framework that separates policy optimization from the complex budget satisfaction problem. Such a decoupling is crucial for efficiently searching the vast policy space of LLMs. This curriculum-based strategy begins with low-cost, simple tasks and gradually increases in complexity, significantly reducing the search's computational overhead. Evaluated on the LLaMA, Mistral, and OPT model families, our framework discovers pruning policies that achieve superior performance over strong heuristic baselines. Crucially, when compared to other search-based algorithms, our method achieves competitive or superior results at a fraction of the computational cost, demonstrating a clear advantage in search efficiency.
comment: 5 pages, 2 figures, 4 tables
☆ AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention
Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in embodied AI tasks. However, existing VLA models, often built upon Vision-Language Models (VLMs), typically process dense visual inputs independently at each timestep. This approach implicitly models the task as a Markov Decision Process (MDP). However, this history-agnostic design is suboptimal for effective visual token processing in dynamic sequential decision-making, as it fails to leverage the context of history. To address this limitation, we reformulate the problem from a Partially Observable Markov Decision Process (POMDP) perspective and propose a novel framework named AVA-VLA. Inspired by the POMDP that the action generation should be conditioned on the belief state. AVA-VLA introduces Active Visual Attention (AVA) to dynamically modulate visual processing. It achieves this by leveraging the recurrent state, which is a neural approximation of the agent's belief state derived from the previous decision step. Specifically, the AVA module uses the recurrent state to compute the soft weights to actively process task-relevant visual tokens based on its historical context. Comprehensive evaluations demonstrate that AVA-VLA achieves state-of-the-art performance across popular robotic benchmarks, including LIBERO and CALVIN. Furthermore, real-world deployments on a dual-arm robot platform validate the framework's practical applicability and robust sim-to-real transferability.
comment: 18 pages, 10 figures
☆ Learning to Compress Graphs via Dual Agents for Consistent Topological Robustness Evaluation
As graph-structured data grow increasingly large, evaluating their robustness under adversarial attacks becomes computationally expensive and difficult to scale. To address this challenge, we propose to compress graphs into compact representations that preserve both topological structure and robustness profile, enabling efficient and reliable evaluation.We propose Cutter, a dual-agent reinforcement learning framework composed of a Vital Detection Agent (VDA) and a Redundancy Detection Agent (RDA), which collaboratively identify structurally vital and redundant nodes for guided compression. Cutter incorporates three key strategies to enhance learning efficiency and compression quality: trajectory-level reward shaping to transform sparse trajectory returns into dense, policy-equivalent learning signals; prototype-based shaping to guide decisions using behavioral patterns from both highand low-return trajectories; and cross-agent imitation to enable safer and more transferable exploration. Experiments on multiple real-world graphs demonstrate that Cutter generates compressed graphs that retain essential static topological properties and exhibit robustness degradation trends highly consistent with the original graphs under various attack scenarios, thereby significantly improving evaluation efficiency without compromising assessment fidelity.
☆ Compressor-VLA: Instruction-Guided Visual Token Compression for Efficient Robotic Manipulation
Vision-Language-Action (VLA) models have emerged as a powerful paradigm in Embodied AI. However, the significant computational overhead of processing redundant visual tokens remains a critical bottleneck for real-time robotic deployment. While standard token pruning techniques can alleviate this, these task-agnostic methods struggle to preserve task-critical visual information. To address this challenge, simultaneously preserving both the holistic context and fine-grained details for precise action, we propose Compressor-VLA, a novel hybrid instruction-conditioned token compression framework designed for efficient, task-oriented compression of visual information in VLA models. The proposed Compressor-VLA framework consists of two token compression modules: a Semantic Task Compressor (STC) that distills holistic, task-relevant context, and a Spatial Refinement Compressor (SRC) that preserves fine-grained spatial details. This compression is dynamically modulated by the natural language instruction, allowing for the adaptive condensation of task-relevant visual information. Experimentally, extensive evaluations demonstrate that Compressor-VLA achieves a competitive success rate on the LIBERO benchmark while reducing FLOPs by 59% and the visual token count by over 3x compared to its baseline. The real-robot deployments on a dual-arm robot platform validate the model's sim-to-real transferability and practical applicability. Moreover, qualitative analyses reveal that our instruction guidance dynamically steers the model's perceptual focus toward task-relevant objects, thereby validating the effectiveness of our approach.
comment: 11 pages, 5 figures
☆ MIST: Mutual Information Via Supervised Training
We propose a fully data-driven approach to designing mutual information (MI) estimators. Since any MI estimator is a function of the observed sample from two random variables, we parameterize this function with a neural network (MIST) and train it end-to-end to predict MI values. Training is performed on a large meta-dataset of 625,000 synthetic joint distributions with known ground-truth MI. To handle variable sample sizes and dimensions, we employ a two-dimensional attention scheme ensuring permutation invariance across input samples. To quantify uncertainty, we optimize a quantile regression loss, enabling the estimator to approximate the sampling distribution of MI rather than return a single point estimate. This research program departs from prior work by taking a fully empirical route, trading universal theoretical guarantees for flexibility and efficiency. Empirically, the learned estimators largely outperform classical baselines across sample sizes and dimensions, including on joint distributions unseen during training. The resulting quantile-based intervals are well-calibrated and more reliable than bootstrap-based confidence intervals, while inference is orders of magnitude faster than existing neural baselines. Beyond immediate empirical gains, this framework yields trainable, fully differentiable estimators that can be embedded into larger learning pipelines. Moreover, exploiting MI's invariance to invertible transformations, meta-datasets can be adapted to arbitrary data modalities via normalizing flows, enabling flexible training for diverse target meta-distributions.
☆ Geometry-Aware Deep Congruence Networks for Manifold Learning in Cross-Subject Motor Imagery
Cross-subject motor-imagery decoding remains a major challenge in EEG-based brain-computer interfaces due to strong subject variability and the curved geometry of covariance matrices on the symmetric positive definite (SPD) manifold. We address the zero-shot cross-subject setting, where no target-subject labels or adaptation are allowed, by introducing novel geometry-aware preprocessing modules and deep congruence networks that operate directly on SPD covariance matrices. Our preprocessing modules, DCR and RiFU, extend Riemannian Alignment by improving action separation while reducing subject-specific distortions. We further propose two manifold classifiers, SPD-DCNet and RiFUNet, which use hierarchical congruence transforms to learn discriminative, subject-invariant covariance representations. On the BCI-IV 2a benchmark, our framework improves cross-subject accuracy by 3-4% over the strongest classical baselines, demonstrating the value of geometry-aware transformations for robust EEG decoding.
comment: 10 pages, 2 figures
☆ SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression
Large Language Models (LLMs) face a significant bottleneck during autoregressive inference due to the massive memory footprint of the Key-Value (KV) cache. Existing compression techniques like token eviction, quantization, or other low-rank methods often risk information loss, have fixed limits, or introduce significant computational overhead from explicit decompression steps. In this work, we introduce SWAN, a novel, fine-tuning-free framework that eliminates this overhead. Our method uses an offline orthogonal matrix to rotate and prune the KV-cache, which is then used directly in the attention computation without any reconstruction. Our extensive experiments demonstrate that SWAN, augmented with a small dense buffer, offers a robust trade-off, maintaining performance close to the uncompressed baseline even at aggressive 50-60% memory savings per-token on KV-cache. A key advantage is its runtime-tunable compression level, allowing operators to dynamically adjust the memory footprint, a flexibility absent in methods requiring fixed offline configurations. This combination of a decompression-free design, high performance under compression, and adaptability makes SWAN a practical and efficient solution for serving LLMs with long contexts.
☆ Learning Solution Operators for Partial Differential Equations via Monte Carlo-Type Approximation NeurIPS 2025
The Monte Carlo-type Neural Operator (MCNO) introduces a lightweight architecture for learning solution operators for parametric PDEs by directly approximating the kernel integral using a Monte Carlo approach. Unlike Fourier Neural Operators, MCNO makes no spectral or translation-invariance assumptions. The kernel is represented as a learnable tensor over a fixed set of randomly sampled points. This design enables generalization across multiple grid resolutions without relying on fixed global basis functions or repeated sampling during training. Experiments on standard 1D PDE benchmarks show that MCNO achieves competitive accuracy with low computational cost, providing a simple and practical alternative to spectral and graph-based neural operators.
comment: NeurIPS 2025 Workshop on Machine Learning and the Physical Sciences
☆ How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining
Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.
☆ VADE: Variance-Aware Dynamic Sampling via Online Sample-Level Difficulty Estimation for Multimodal RL
Group-based policy optimization methods like GRPO and GSPO have become standard for training multimodal models, leveraging group-wise rollouts and relative advantage estimation. However, they suffer from a critical \emph{gradient vanishing} problem when all responses within a group receive identical rewards, causing advantage estimates to collapse and training signals to diminish. Existing attempts to mitigate this issue fall into two paradigms: filtering-based and sampling-based methods. Filtering-based methods first generate rollouts broadly and then retroactively filter out uninformative groups, leading to substantial computational overhead. Sampling-based methods proactively select effective samples before rollout but rely on static criteria or prior dataset knowledge, lacking real-time adaptability. To address these issues, we propose \textbf{VADE}, a \textbf{V}ariance-\textbf{A}ware \textbf{D}ynamic sampling framework via online sample-level difficulty \textbf{E}stimation. Our framework integrates three key components: online sample-level difficulty estimation using Beta distributions, a Thompson sampler that maximizes information gain through the estimated correctness probability, and a two-scale prior decay mechanism that maintains robust estimation under policy evolution. This three components design enables VADE to dynamically select the most informative samples, thereby amplifying training signals while eliminating extra rollout costs. Extensive experiments on multimodal reasoning benchmarks show that VADE consistently outperforms strong baselines in both performance and sample efficiency, while achieving a dramatic reduction in computational overhead. More importantly, our framework can serves as a plug-and-play component to be seamlessly integrated into existing group-based RL algorithms. Code and models are available at https://VADE-RL.github.io.
☆ Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models NeurIPS 2025
Efficient deployment of small language models (SLMs) is essential for numerous real-world applications with stringent latency constraints. While previous work on SLM design has primarily focused on reducing the number of parameters to achieve parameter-optimal SLMs, parameter efficiency does not necessarily translate into proportional real-device speed-ups. This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training when real-device latency is the primary consideration. Specifically, we identify two central architectural factors: depth-width ratios and operator choices. The former is crucial for small-batch-size latency, while the latter affects both latency and large-batch-size throughput. In light of this, we first study latency-optimal depth-width ratios, with the key finding that although deep-thin models generally achieve better accuracy under the same parameter budget, they may not lie on the accuracy-latency trade-off frontier. Next, we explore emerging efficient attention alternatives to evaluate their potential as candidate building operators. Using the identified promising operators, we construct an evolutionary search framework to automatically discover latency-optimal combinations of these operators within hybrid SLMs, thereby advancing the accuracy-latency frontier. In addition to architectural improvements, we further enhance SLM training using a weight normalization technique that enables more effective weight updates and improves final convergence. Combining these methods, we introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs, e.g., achieving over +5.5% average accuracy, 1.3x/1.9x lower latency, and 18.7x/45.6x higher throughput compared to Qwen3-1.7B/0.6B, respectively.
comment: Accepted by NeurIPS 2025
☆ Hi-SAFE: Hierarchical Secure Aggregation for Lightweight Federated Learning
Federated learning (FL) faces challenges in ensuring both privacy and communication efficiency, particularly in resource-constrained environments such as Internet of Things (IoT) and edge networks. While sign-based methods, such as sign stochastic gradient descent with majority voting (SIGNSGD-MV), offer substantial bandwidth savings, they remain vulnerable to inference attacks due to exposure of gradient signs. Existing secure aggregation techniques are either incompatible with sign-based methods or incur prohibitive overhead. To address these limitations, we propose Hi-SAFE, a lightweight and cryptographically secure aggregation framework for sign-based FL. Our core contribution is the construction of efficient majority vote polynomials for SIGNSGD-MV, derived from Fermat's Little Theorem. This formulation represents the majority vote as a low-degree polynomial over a finite field, enabling secure evaluation that hides intermediate values and reveals only the final result. We further introduce a hierarchical subgrouping strategy that ensures constant multiplicative depth and bounded per-user complexity, independent of the number of users n.
comment: currently submitted and awaiting review at the IEEE Internet of Things Journal
☆ Fairness Meets Privacy: Integrating Differential Privacy and Demographic Parity in Multi-class Classification
The increasing use of machine learning in sensitive applications demands algorithms that simultaneously preserve data privacy and ensure fairness across potentially sensitive sub-populations. While privacy and fairness have each been extensively studied, their joint treatment remains poorly understood. Existing research often frames them as conflicting objectives, with multiple studies suggesting that strong privacy notions such as differential privacy inevitably compromise fairness. In this work, we challenge that perspective by showing that differential privacy can be integrated into a fairness-enhancing pipeline with minimal impact on fairness guarantees. We design a postprocessing algorithm, called DP2DP, that enforces both demographic parity and differential privacy. Our analysis reveals that our algorithm converges towards its demographic parity objective at essentially the same rate (up logarithmic factor) as the best non-private methods from the literature. Experiments on both synthetic and real datasets confirm our theoretical results, showing that the proposed algorithm achieves state-of-the-art accuracy/fairness/privacy trade-offs.
☆ GContextFormer: A global context-aware hybrid multi-head attention approach with scaled additive aggregation for multimodal trajectory prediction
Multimodal trajectory prediction generates multiple plausible future trajectories to address vehicle motion uncertainty from intention ambiguity and execution variability. However, HD map-dependent models suffer from costly data acquisition, delayed updates, and vulnerability to corrupted inputs, causing prediction failures. Map-free approaches lack global context, with pairwise attention over-amplifying straight patterns while suppressing transitional patterns, resulting in motion-intention misalignment. This paper proposes GContextFormer, a plug-and-play encoder-decoder architecture with global context-aware hybrid attention and scaled additive aggregation achieving intention-aligned multimodal prediction without map reliance. The Motion-Aware Encoder builds scene-level intention prior via bounded scaled additive aggregation over mode-embedded trajectory tokens and refines per-mode representations under shared global context, mitigating inter-mode suppression and promoting intention alignment. The Hierarchical Interaction Decoder decomposes social reasoning into dual-pathway cross-attention: a standard pathway ensures uniform geometric coverage over agent-mode pairs while a neighbor-context-enhanced pathway emphasizes salient interactions, with gating module mediating their contributions to maintain coverage-focus balance. Experiments on eight highway-ramp scenarios from TOD-VT dataset show GContextFormer outperforms state-of-the-art baselines. Compared to existing transformer models, GContextFormer achieves greater robustness and concentrated improvements in high-curvature and transition zones via spatial distributions. Interpretability is achieved through motion mode distinctions and neighbor context modulation exposing reasoning attribution. The modular architecture supports extensibility toward cross-domain multimodal reasoning tasks. Source: https://fenghy-chen.github.io/sources/.
☆ Periodic Asynchrony: An Effective Method for Accelerating On-Policy Reinforcement Learning
Since the introduction of the GRPO algorithm, reinforcement learning (RL) has attracted increasing attention, with growing efforts to reproduce and apply it. However, training efficiency remains a critical challenge. In mainstream RL frameworks, inference and training are typically deployed on the same devices. While this approach reduces costs through resource consolidation, its synchronous execution imposes a computational coupling that prevents concurrent inference and training. In this study, we are returning to the strategy of separating inference and training deployment, and by introducing improvements in the data loader, we transform the conventional synchronous architecture into a periodically asynchronous framework, which allows for demand-driven, independent, and elastic scaling of each component, while the accuracy of the algorithm remains completely equivalent to the synchronization method, with both belonging to the on-policy strategy. It is worth emphasizing that we apply a unified tri-model architecture in the training phase, and we also proposed a shared-prompt attention mask to reduce repetitive computation. In practice, these works have achieved at least a threefold overall performance improvement in RL training on NPU platforms, indicating its potential for widespread application.
☆ KernelBand: Boosting LLM-based Kernel Optimization with a Hierarchical and Hardware-aware Multi-armed Bandit
High quality kernels are critical for reducing training and inference costs of Large Language Models (LLMs), yet they traditionally require significant expertise in hardware architecture and software optimization. While recent advances in LLM-based code generation show promise for complex optimization, existing methods struggle with the vast optimization space due to insufficient hardware domain knowledge, failing to effectively balance exploration and exploitation. We present KernelBand, a novel framework that formulates kernel optimization as a hierarchical multi-armed bandit problem, enabling LLM agents to strategically navigate the optimization space by treating kernel selection and optimization strategy application as sequential decision-making processes. Our approach leverages hardware profiling information to identify promising optimization strategies and employs runtime behavior clustering to reduce exploration overhead across kernel candidates. Extensive experiments on TritonBench demonstrate that KernelBand significantly outperforms state-of-the-art methods, achieving superior performance with fewer tokens while exhibiting consistent improvement without saturation as computational resources increase.
comment: Work in progress
☆ Robust and Generalizable GNN Fine-Tuning via Uncertainty-aware Adapter Learning
Recently, fine-tuning large-scale pre-trained GNNs has yielded remarkable attention in adapting pre-trained GNN models for downstream graph learning tasks. One representative fine-tuning method is to exploit adapter (termed AdapterGNN) which aims to 'augment' the pre-trained model by inserting a lightweight module to make the 'augmented' model better adapt to the downstream tasks. However, graph data may contain various types of noise in downstream tasks, such as noisy edges and ambiguous node attributes. Existing AdapterGNNs are often prone to graph noise and exhibit limited generalizability. How to enhance the robustness and generalization ability of GNNs' fine tuning remains an open problem. In this paper, we show that the above problem can be well addressed by integrating uncertainty learning into the GNN adapter. We propose the Uncertainty-aware Adapter (UAdapterGNN) that fortifies pre-trained GNN models against noisy graph data in the fine-tuning process. Specifically, in contrast to regular AdapterGNN, our UAdapterGNN exploits Gaussian probabilistic adapter to augment the pre-trained GNN model. In this way, when the graph contains various noises,our method can automatically absorb the effects of changes in the variances of the Gaussian distribution, thereby significantly enhancing the model's robustness. Also, UAdapterGNN can further improve the generalization ability of the model on the downstream tasks. Extensive experiments on several benchmarks demonstrate the effectiveness, robustness and high generalization ability of the proposed UAdapterGNN method.
☆ WaveTuner: Comprehensive Wavelet Subband Tuning for Time Series Forecasting
Due to the inherent complexity, temporal patterns in real-world time series often evolve across multiple intertwined scales, including long-term periodicity, short-term fluctuations, and abrupt regime shifts. While existing literature has designed many sophisticated decomposition approaches based on the time or frequency domain to partition trend-seasonality components and high-low frequency components, an alternative line of approaches based on the wavelet domain has been proposed to provide a unified multi-resolution representation with precise time-frequency localization. However, most wavelet-based methods suffer from a persistent bias toward recursively decomposing only low-frequency components, severely underutilizing subtle yet informative high-frequency components that are pivotal for precise time series forecasting. To address this problem, we propose WaveTuner, a Wavelet decomposition framework empowered by full-spectrum subband Tuning for time series forecasting. Concretely, WaveTuner comprises two key modules: (i) Adaptive Wavelet Refinement module, that transforms time series into time-frequency coefficients, utilizes an adaptive router to dynamically assign subband weights, and generates subband-specific embeddings to support refinement; and (ii) Multi-Branch Specialization module, that employs multiple functional branches, each instantiated as a flexible Kolmogorov-Arnold Network (KAN) with a distinct functional order to model a specific spectral subband. Equipped with these modules, WaveTuner comprehensively tunes global trends and local variations within a unified time-frequency framework. Extensive experiments on eight real-world datasets demonstrate WaveTuner achieves state-of-the-art forecasting performance in time series forecasting.
☆ A Reproducible Framework for Neural Topic Modeling in Focus Group Analysis
Focus group discussions generate rich qualitative data but their analysis traditionally relies on labor-intensive manual coding that limits scalability and reproducibility. We present a rigorous, reproducible computational framework for applying neural topic modeling to focus group transcripts, addressing fundamental methodological challenges: hyperparameter sensitivity, model stability, and validation of interpretability. Using BERTopic applied to ten focus groups exploring HPV vaccine perceptions in Tunisia (1,076 utterances), we conducted systematic evaluation across 27 hyperparameter configurations, assessed stability through bootstrap resampling with 30 replicates per configuration, and validated interpretability through formal human evaluation by three domain experts. Our analysis demonstrates substantial sensitivity to hyperparameter choices and reveals that metric selection for stability assessment must align with analytical goals. A hierarchical merging strategy (extracting fine-grained topics for stability then consolidating for interpretability) effectively navigates the stability-coherence tradeoff, achieving coherence of 0.558 compared to 0.539 for direct extraction. Human validation confirmed topic quality with very good inter-rater reliability (ICC = 0.79, weighted Cohen's kappa = 0.578). Our framework provides practical guidelines that researchers can adapt to their own qualitative research contexts. All code, data processing scripts, and evaluation protocols are publicly available to support reproduction and extension of this work.
☆ Federated style aware transformer aggregation of representations
Personalized Federated Learning (PFL) faces persistent challenges, including domain heterogeneity from diverse client data, data imbalance due to skewed participation, and strict communication constraints. Traditional federated learning often lacks personalization, as a single global model cannot capture client-specific characteristics, leading to biased predictions and poor generalization, especially for clients with highly divergent data distributions. To address these issues, we propose FedSTAR, a style-aware federated learning framework that disentangles client-specific style factors from shared content representations. FedSTAR aggregates class-wise prototypes using a Transformer-based attention mechanism, allowing the server to adaptively weight client contributions while preserving personalization. Furthermore, by exchanging compact prototypes and style vectors instead of full model parameters, FedSTAR significantly reduces communication overhead. Experimental results demonstrate that combining content-style disentanglement with attention-driven prototype aggregation improves personalization and robustness in heterogeneous environments without increasing communication cost.
☆ Enhancing Multi-Label Thoracic Disease Diagnosis with Deep Ensemble-Based Uncertainty Quantification
The utility of deep learning models, such as CheXNet, in high stakes clinical settings is fundamentally constrained by their purely deterministic nature, failing to provide reliable measures of predictive confidence. This project addresses this critical gap by integrating robust Uncertainty Quantification (UQ) into a high performance diagnostic platform for 14 common thoracic diseases on the NIH ChestX-ray14 dataset. Initial architectural development failed to stabilize performance and calibration using Monte Carlo Dropout (MCD), yielding an unacceptable Expected Calibration Error (ECE) of 0.7588. This technical failure necessitated a rigorous architectural pivot to a high diversity, 9-member Deep Ensemble (DE). This resulting DE successfully stabilized performance and delivered superior reliability, achieving a State-of-the-Art (SOTA) average Area Under the Receiver Operating Characteristic Curve (AUROC) of 0.8559 and an average F1 Score of 0.3857. Crucially, the DE demonstrated superior calibration (Mean ECE of 0.0728 and Negative Log-Likelihood (NLL) of 0.1916) and enabled the reliable decomposition of total uncertainty into its Aleatoric (irreducible data noise) and Epistemic (reducible model knowledge) components, with a mean Epistemic Uncertainty (EU) of 0.0240. These results establish the Deep Ensemble as a trustworthy and explainable platform, transforming the model from a probabilistic tool into a reliable clinical decision support system.
☆ Auto-ML Graph Neural Network Hypermodels for Outcome Prediction in Event-Sequence Data
This paper introduces HGNN(O), an AutoML GNN hypermodel framework for outcome prediction on event-sequence data. Building on our earlier work on graph convolutional network hypermodels, HGNN(O) extends four architectures-One Level, Two Level, Two Level Pseudo Embedding, and Two Level Embedding-across six canonical GNN operators. A self-tuning mechanism based on Bayesian optimization with pruning and early stopping enables efficient adaptation over architectures and hyperparameters without manual configuration. Empirical evaluation on both balanced and imbalanced event logs shows that HGNN(O) achieves accuracy exceeding 0.98 on the Traffic Fines dataset and weighted F1 scores up to 0.86 on the Patients dataset without explicit imbalance handling. These results demonstrate that the proposed AutoML-GNN approach provides a robust and generalizable benchmark for outcome prediction in complex event-sequence data.
comment: 6 pages
☆ Leveraging Duration Pseudo-Embeddings in Multilevel LSTM and GCN Hypermodels for Outcome-Oriented PPM
Existing deep learning models for Predictive Process Monitoring (PPM) struggle with temporal irregularities, particularly stochastic event durations and overlapping timestamps, limiting their adaptability across heterogeneous datasets. We propose a dual input neural network strategy that separates event and sequence attributes, using a duration-aware pseudo-embedding matrix to transform temporal importance into compact, learnable representations. This design is implemented across two baseline families: B-LSTM and B-GCN, and their duration-aware variants D-LSTM and D-GCN. All models incorporate self-tuned hypermodels for adaptive architecture selection. Experiments on balanced and imbalanced outcome prediction tasks show that duration pseudo-embedding inputs consistently improve generalization, reduce model complexity, and enhance interpretability. Our results demonstrate the benefits of explicit temporal encoding and provide a flexible design for robust, real-world PPM applications.
comment: 12 pages
☆ Towards Characterizing Knowledge Distillation of PPG Heart Rate Estimation Models NeurIPS 2025
Heart rate estimation from photoplethysmography (PPG) signals generated by wearable devices such as smartwatches and fitness trackers has significant implications for the health and well-being of individuals. Although prior work has demonstrated deep learning models with strong performance in the heart rate estimation task, in order to deploy these models on wearable devices, these models must also adhere to strict memory and latency constraints. In this work, we explore and characterize how large pre-trained PPG models may be distilled to smaller models appropriate for real-time inference on the edge. We evaluate four distillation strategies through comprehensive sweeps of teacher and student model capacities: (1) hard distillation, (2) soft distillation, (3) decoupled knowledge distillation (DKD), and (4) feature distillation. We present a characterization of the resulting scaling laws describing the relationship between model size and performance. This early investigation lays the groundwork for practical and predictable methods for building edge-deployable models for physiological sensing.
comment: To be published in: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Learning from Time Series for Health
☆ Solving a Research Problem in Mathematical Statistics with AI Assistance
Over the last few months, AI models including large language models have improved greatly. There are now several documented examples where they have helped professional mathematical scientists prove new results, sometimes even helping resolve known open problems. In this short note, we add another example to the list, by documenting how we were able to solve a previously unsolved research problem in robust mathematical statistics with crucial help from GPT-5. Our problem concerns robust density estimation, where the observations are perturbed by Wasserstein-bounded contaminations.In a previous preprint (Chao and Dobriban, 2023, arxiv:2308.01853v2), we have obtained upper and lower bounds on the minimax optimal estimation error; which were, however, not sharp. Starting in October 2025, making significant use of GPT-5 Pro, we were able to derive the minimax optimal error rate (reported in version 3 of the above arxiv preprint). GPT-5 provided crucial help along the way, including by suggesting calculations that we did not think of, and techniques that were not familiar to us, such as the dynamic Benamou-Brenier formulation, for key steps in the analysis. Working with GPT-5 took a few weeks of effort, and we estimate that it could have taken several months to get the same results otherwise. At the same time, there are still areas where working with GPT-5 was challenging: it sometimes provided incorrect references, and glossed over details that sometimes took days of work to fill in. We outline our workflow and steps taken to mitigate issues. Overall, our work can serve as additional documentation for a new age of human-AI collaborative work in mathematical science.
☆ Uncertainty-Aware Dual-Student Knowledge Distillation for Efficient Image Classification
Knowledge distillation has emerged as a powerful technique for model compression, enabling the transfer of knowledge from large teacher networks to compact student models. However, traditional knowledge distillation methods treat all teacher predictions equally, regardless of the teacher's confidence in those predictions. This paper proposes an uncertainty-aware dual-student knowledge distillation framework that leverages teacher prediction uncertainty to selectively guide student learning. We introduce a peer-learning mechanism where two heterogeneous student architectures, specifically ResNet-18 and MobileNetV2, learn collaboratively from both the teacher network and each other. Experimental results on ImageNet-100 demonstrate that our approach achieves superior performance compared to baseline knowledge distillation methods, with ResNet-18 achieving 83.84\% top-1 accuracy and MobileNetV2 achieving 81.46\% top-1 accuracy, representing improvements of 2.04\% and 0.92\% respectively over traditional single-student distillation approaches.
☆ Solution of Incompressible Flow Equations with Physics and Equality Constrained Artificial Neural Networks
We present a meshless method for the solution of incompressible Navier-Stokes equations in advection-dominated regimes using physics- and equality-constrained artificial neural networks combined with a conditionally adaptive augmented Lagrangian formulation. A single neural network parameterizes both the velocity and pressure fields, and is trained by minimizing the residual of a Poisson's equation for pressure, constrained by the momentum and continuity equations, together with boundary conditions on the velocity field. No boundary conditions are imposed on the pressure field aside from anchoring the pressure at a point to prevent its unbounded development. The training is performed from scratch without labeled data, relying solely on the governing equations and constraints. To enhance accuracy in advection-dominated flows, we employ a single Fourier feature mapping of the input coordinates. The proposed method is demonstrated for the canonical lid-driven cavity flow up to a Reynolds number of 7,500 and for laminar flow over a circular cylinder with inflow-outflow boundary conditions, achieving excellent agreement with benchmark solutions. We further compare the present formulation against alternative objective-function constructions based on different arrangements of the flow equations, thereby highlighting the algorithmic advantages of the proposed formulation centered around the Poisson's equation for pressure.
comment: 21 pages, 13 figures
♻ ☆ Cost-Aware Contrastive Routing for LLMs
We study cost-aware routing for large language models across diverse and dynamic pools of models. Existing approaches often overlook prompt-specific context, rely on expensive model profiling, assume a fixed set of experts, or use inefficient trial-and-error strategies. We introduce Cost-Spectrum Contrastive Routing (CSCR), a lightweight framework that maps both prompts and models into a shared embedding space to enable fast, cost-sensitive selection. CSCR uses compact, fast-to-compute logit footprints for open-source models and perplexity fingerprints for black-box APIs. A contrastive encoder is trained to favor the cheapest accurate expert within adaptive cost bands. At inference time, routing reduces to a single k-NN lookup via a FAISS index, requiring no retraining when the expert pool changes and enabling microsecond latency. Across multiple benchmarks, CSCR consistently outperforms baselines, improving the accuracy-cost tradeoff by up to 25%, while generalizing robustly to unseen LLMs and out-of-distribution prompts.
♻ ☆ Collapsing Taylor Mode Automatic Differentiation NeurIPS 2025
Computing partial differential equation (PDE) operators via nested backpropagation is expensive, yet popular, and severely restricts their utility for scientific machine learning. Recent advances, like the forward Laplacian and randomizing Taylor mode automatic differentiation (AD), propose forward schemes to address this. We introduce an optimization technique for Taylor mode that 'collapses' derivatives by rewriting the computational graph, and demonstrate how to apply it to general linear PDE operators, and randomized Taylor mode. The modifications simply require propagating a sum up the computational graph, which could -- or should -- be done by a machine learning compiler, without exposing complexity to users. We implement our collapsing procedure and evaluate it on popular PDE operators, confirming it accelerates Taylor mode and outperforms nested backpropagation.
comment: 10 pages + appendix; camera-ready version (NeurIPS 2025)
♻ ☆ SING: SDE Inference via Natural Gradients NeurIPS
Latent stochastic differential equation (SDE) models are important tools for the unsupervised discovery of dynamical systems from data, with applications ranging from engineering to neuroscience. In these complex domains, exact posterior inference of the latent state path is typically intractable, motivating the use of approximate methods such as variational inference (VI). However, existing VI methods for inference in latent SDEs often suffer from slow convergence and numerical instability. We propose SDE Inference via Natural Gradients (SING), a method that leverages natural gradient VI to efficiently exploit the underlying geometry of the model and variational posterior. SING enables fast and reliable inference in latent SDE models by approximating intractable integrals and parallelizing computations in time. We provide theoretical guarantees that SING approximately optimizes the intractable, continuous-time objective of interest. Moreover, we demonstrate that better state inference enables more accurate estimation of nonlinear drift functions using, for example, Gaussian process SDE models. SING outperforms prior methods in state inference and drift estimation on a variety of datasets, including a challenging application to modeling neural dynamics in freely behaving animals. Altogether, our results illustrate the potential of SING as a tool for accurate inference in complex dynamical systems, especially those characterized by limited prior knowledge and non-conjugate structure.
comment: To appear in Advances in Neural Processing Information Systems (NeurIPS), 2025
♻ ☆ MiniF2F in Rocq: Automatic Translation Between Proof Assistants -- A Case Study
In this work, we conduct an experiment using state-of-the-art LLMs to translate MiniF2F into Rocq. The translation task focuses on generating a Rocq theorem based on three sources: a natural language description, the Lean formalization, and the Isabelle formalization. We conducted our experiment in 3 stages of increasing complexity, from basic one-shot prompting to multi-turn conversations that incorporate feedback from unsuccessful attempts. At each stage, we perform multiple rounds of translation using increasingly advanced models: GPT-4o mini, Claude 3.5 Sonnet, o1 mini, and o1. We successfully translated 478 out of 488 theorems. The dataset is opensource: https://github.com/LLM4Rocq/miniF2F-rocq.
♻ ☆ Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models NeurIPS 2025
Robust coordination is critical for effective decision-making in multi-agent systems, especially under partial observability. A central question in Multi-Agent Reinforcement Learning (MARL) is whether to engineer communication protocols or learn them end-to-end. We investigate this dichotomy using embodied world models. We propose and compare two communication strategies for a cooperative task-allocation problem. The first, Learned Direct Communication (LDC), learns a protocol end-to-end. The second, Intention Communication, uses an engineered inductive bias: a compact, learned world model, the Imagined Trajectory Generation Module (ITGM), which uses the agent's own policy to simulate future states. A Message Generation Network (MGN) then compresses this plan into a message. We evaluate these approaches on goal-directed interaction in a grid world, a canonical abstraction for embodied AI problems, while scaling environmental complexity. Our experiments reveal that while emergent communication is viable in simple settings, the engineered, world model-based approach shows superior performance, sample efficiency, and scalability as complexity increases. These findings advocate for integrating structured, predictive models into MARL agents to enable active, goal-driven coordination.
comment: Published in the Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Scaling Environments for Agents (SEA). Additionally accepted for presentation in the NeurIPS 2025 Workshop: Embodied World Models for Decision Making (EWM) and the NeurIPS 2025 Workshop: Optimization for Machine Learning (OPT)
♻ ☆ Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics
Learning robust and generalizable world models is crucial for enabling efficient and scalable robotic control in real-world environments. In this work, we introduce a novel framework for learning world models that accurately capture complex, partially observable, and stochastic dynamics. The proposed method employs a dual-autoregressive mechanism and self-supervised training to achieve reliable long-horizon predictions without relying on domain-specific inductive biases, ensuring adaptability across diverse robotic tasks. We further propose a policy optimization framework that leverages world models for efficient training in imagined environments and seamless deployment in real-world systems. This work advances model-based reinforcement learning by addressing the challenges of long-horizon prediction, error accumulation, and sim-to-real transfer. By providing a scalable and robust framework, the introduced methods pave the way for adaptive and efficient robotic systems in real-world applications.
♻ ☆ PEANuT: Parameter-Efficient Adaptation with Weight-aware Neural Tweakers
Fine-tuning large pre-trained foundation models often yields excellent downstream performance but is prohibitively expensive when updating all parameters. Parameter-efficient fine-tuning (PEFT) methods such as LoRA alleviate this by introducing lightweight update modules, yet they commonly rely on weight-agnostic linear approximations, limiting their expressiveness. In this work, we propose PEANuT, a novel PEFT framework that introduces weight-aware neural tweakers, compact neural modules that generate task-adaptive updates conditioned on frozen pre-trained weights. PEANuT provides a flexible yet efficient way to capture complex update patterns without full model tuning. We theoretically show that PEANuT achieves equivalent or greater expressivity than existing linear PEFT methods with comparable or fewer parameters. Extensive experiments across four benchmarks with over twenty datasets demonstrate that PEANuT consistently outperforms strong baselines in both NLP and vision tasks, while maintaining low computational overhead.
♻ ☆ Node Preservation and its Effect on Crossover in Cartesian Genetic Programming
While crossover is a critical and often indispensable component in other forms of Genetic Programming, such as Linear- and Tree-based, it has consistently been claimed that it deteriorates search performance in CGP. As a result, a mutation-alone $(1+λ)$ evolutionary strategy has become the canonical approach for CGP. Although several operators have been developed that demonstrate an increased performance over the canonical method, a general solution to the problem is still lacking. In this paper, we compare basic crossover methods, namely one-point and uniform, to variants in which nodes are ``preserved,'' including the subgraph crossover developed by Roman Kalkreuth, the difference being that when ``node preservation'' is active, crossover is not allowed to break apart instructions. We also compare a node mutation operator to the traditional point mutation; the former simply replaces an entire node with a new one. We find that node preservation in both mutation and crossover improves search using symbolic regression benchmark problems, moving the field towards a general solution to CGP crossover.
comment: Draft to cite in another paper before both papers are peer-reviewed for the evo*2026 conference, 21 pages, 5 figures
♻ ☆ Random Spiking Neural Networks are Stable and Spectrally Simple
Spiking neural networks (SNNs) are a promising paradigm for energy-efficient computation, yet their theoretical foundations-especially regarding stability and robustness-remain limited compared to artificial neural networks. In this work, we study discrete-time leaky integrate-and-fire (LIF) SNNs through the lens of Boolean function analysis. We focus on noise sensitivity and stability in classification tasks, quantifying how input perturbations affect outputs. Our main result shows that wide LIF-SNN classifiers are stable on average, a property explained by the concentration of their Fourier spectrum on low-frequency components. Motivated by this, we introduce the notion of spectral simplicity, which formalizes simplicity in terms of Fourier spectrum concentration and connects our analysis to the simplicity bias observed in deep networks. Within this framework, we show that random LIF-SNNs are biased toward simple functions. Experiments on trained networks confirm that these stability properties persist in practice. Together, these results provide new insights into the stability and robustness properties of SNNs.
♻ ☆ Enhancing Domain-Specific Encoder Models with LLM-Generated Data: How to Leverage Ontologies, and How to Do Without Them EMNLP 2025
We investigate the use of LLM-generated data for continual pretraining of encoder models in specialized domains with limited training data, using the scientific domain of invasion biology as a case study. To this end, we leverage domain-specific ontologies by enriching them with LLM-generated data and pretraining the encoder model as an ontology-informed embedding model for concept definitions. To evaluate the effectiveness of this method, we compile a benchmark specifically designed for assessing model performance in invasion biology. After demonstrating substantial improvements over standard LLM pretraining, we investigate the feasibility of applying the proposed approach to domains without comprehensive ontologies by substituting ontological concepts with concepts automatically extracted from a small corpus of scientific abstracts and establishing relationships between concepts through distributional statistics. Our results demonstrate that this automated approach achieves comparable performance using only a small set of scientific abstracts, resulting in a fully automated pipeline for enhancing domain-specific understanding of small encoder models that is especially suited for application in low-resource settings and achieves performance comparable to masked language modeling pretraining on much larger datasets.
comment: Published in the Findings of the Association for Computational Linguistics: EMNLP 2025
♻ ☆ Interpreting Graph Inference with Skyline Explanations ICDE 2026
Inference queries have been routinely issued to graph machine learning models such as graph neural networks (GNNs) for various network analytical tasks. Nevertheless, GNN outputs are often hard to interpret comprehensively. Existing methods typically conform to individual pre-defined explainability measures (such as fidelity), which often leads to biased, ``one-side'' interpretations. This paper introduces skyline explanation, a new paradigm that interprets GNN outputs by simultaneously optimizing multiple explainability measures of users' interests. (1) We propose skyline explanations as a Pareto set of explanatory subgraphs that dominate others over multiple explanatory measures. We formulate skyline explanation as a multi-criteria optimization problem, and establish its hardness results. (2) We design efficient algorithms with an onion-peeling approach, which strategically prioritizes nodes and removes unpromising edges to incrementally assemble skyline explanations. (3) We also develop an algorithm to diversify the skyline explanations to enrich the comprehensive interpretation. (4) We introduce efficient parallel algorithms with load-balancing strategies to scale skyline explanation for large-scale GNN-based inference. Using real-world and synthetic graphs, we experimentally verify our algorithms' effectiveness and scalability.
comment: Accepted at ICDE 2026
♻ ☆ When do World Models Successfully Learn Dynamical Systems?
In this work, we explore the use of compact latent representations with learned time dynamics ('World Models') to simulate physical systems. Drawing on concepts from control theory, we propose a theoretical framework that explains why projecting time slices into a low-dimensional space and then concatenating to form a history ('Tokenization') is so effective at learning physics datasets, and characterise when exactly the underlying dynamics admit a reconstruction mapping from the history of previous tokenized frames to the next. To validate these claims, we develop a sequence of models with increasing complexity, starting with least-squares regression and progressing through simple linear layers, shallow adversarial learners, and ultimately full-scale generative adversarial networks (GANs). We evaluate these models on a variety of datasets, including modified forms of the heat and wave equations, the chaotic regime 2D Kuramoto-Sivashinsky equation, and a challenging computational fluid dynamics (CFD) dataset of a 2D Kármán vortex street around a fixed cylinder, where our model is successfully able to recreate the flow.
♻ ☆ Entropic Time Schedulers for Generative Diffusion Models
The practical performance of generative diffusion models depends on the appropriate choice of the noise scheduling function, which can also be equivalently expressed as a time reparameterization. In this paper, we present a time scheduler that selects sampling points based on entropy rather than uniform time spacing, ensuring that each point contributes an equal amount of information to the final generation. We prove that this time reparameterization does not depend on the initial choice of time. Furthermore, we provide a tractable exact formula to estimate this \emph{entropic time} for a trained model using the training loss without substantial overhead. Alongside the entropic time, inspired by the optimality results, we introduce a rescaled entropic time. In our experiments with mixtures of Gaussian distributions and ImageNet, we show that using the (rescaled) entropic times greatly improves the inference performance of trained models. In particular, we found that the image quality in pretrained EDM2 models, as evaluated by FID and FD-DINO scores, can be substantially increased by the rescaled entropic time reparameterization without increasing the number of function evaluations, with greater improvements in the few NFEs regime. Code is available at https://github.com/DejanStancevic/Entropic-Time-Schedulers-for-Generative-Diffusion-Models.
comment: 31 pages
♻ ☆ The Geometry of Cortical Computation: Manifold Disentanglement and Predictive Dynamics in VCNet NeurIPS 2025
Despite their success, modern convolutional neural networks (CNNs) exhibit fundamental limitations, including data inefficiency, poor out-of-distribution generalization, and vulnerability to adversarial perturbations. These shortcomings can be traced to a lack of inductive biases that reflect the inherent geometric structure of the visual world. The primate visual system, in contrast, demonstrates superior efficiency and robustness, suggesting that its architectural and computational principles,which evolved to internalize these structures,may offer a blueprint for more capable artificial vision. This paper introduces Visual Cortex Network (VCNet), a novel neural network architecture whose design is informed by the macro-scale organization of the primate visual cortex. VCNet is framed as a geometric framework that emulates key biological mechanisms, including hierarchical processing across distinct cortical areas, dual-stream information segregation for learning disentangled representations, and top-down predictive feedback for representation refinement. We interpret these mechanisms through the lens of geometry and dynamical systems, positing that they guide the learning of structured, low-dimensional neural manifolds. We evaluate VCNet on two specialized benchmarks: the Spots-10 animal pattern dataset, which probes sensitivity to natural textures, and a light field image classification task, which requires processing higher-dimensional visual data. Our results show that VCNet achieves state-of-the-art accuracy of 92.1\% on Spots-10 and 74.4\% on the light field dataset, surpassing contemporary models of comparable size. This work demonstrates that integrating high-level neuroscientific principles, viewed through a geometric lens, can lead to more efficient and robust models, providing a promising direction for addressing long-standing challenges in machine learning.
comment: Published in the proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Symmetry and Geometry in Neural Representations (NeurReps). Additionally accepted for presentation in NeurIPS 2025 Workshop: Interpreting Cognition in Deep Learning Models (CogInterp)
♻ ☆ Learning Protein-Ligand Binding in Hyperbolic Space
Protein-ligand binding prediction is central to virtual screening and affinity ranking, two fundamental tasks in drug discovery. While recent retrieval-based methods embed ligands and protein pockets into Euclidean space for similarity-based search, the geometry of Euclidean embeddings often fails to capture the hierarchical structure and fine-grained affinity variations intrinsic to molecular interactions. In this work, we propose HypSeek, a hyperbolic representation learning framework that embeds ligands, protein pockets, and sequences into Lorentz-model hyperbolic space. By leveraging the exponential geometry and negative curvature of hyperbolic space, HypSeek enables expressive, affinity-sensitive embeddings that can effectively model both global activity and subtle functional differences-particularly in challenging cases such as activity cliffs, where structurally similar ligands exhibit large affinity gaps. Our mode unifies virtual screening and affinity ranking in a single framework, introducing a protein-guided three-tower architecture to enhance representational structure. HypSeek improves early enrichment in virtual screening on DUD-E from 42.63 to 51.44 (+20.7%) and affinity ranking correlation on JACS from 0.5774 to 0.7239 (+25.4%), demonstrating the benefits of hyperbolic geometry across both tasks and highlighting its potential as a powerful inductive bias for protein-ligand modeling.
♻ ☆ A Bayesian Model for Multi-stage Censoring ML4H 2025
Many sequential decision settings in healthcare feature funnel structures characterized by a series of stages, such as screenings or evaluations, where the number of patients who advance to each stage progressively decreases and decisions become increasingly costly. For example, an oncologist may first conduct a breast exam, followed by a mammogram for patients with concerning exams, followed by a biopsy for patients with concerning mammograms. A key challenge is that the ground truth outcome, such as the biopsy result, is only revealed at the end of this funnel. The selective censoring of the ground truth can introduce statistical biases in risk estimation, especially in underserved patient groups, whose outcomes are more frequently censored. We develop a Bayesian model for funnel decision structures, drawing from prior work on selective labels and censoring. We first show in synthetic settings that our model is able to recover the true parameters and predict outcomes for censored patients more accurately than baselines. We then apply our model to a dataset of emergency department visits, where in-hospital mortality is observed only for those who are admitted to either the hospital or ICU. We find that there are gender-based differences in hospital and ICU admissions. In particular, our model estimates that the mortality risk threshold to admit women to the ICU is higher for women (5.1%) than for men (4.5%).
comment: Proceedings of ML4H 2025
♻ ☆ FOCUS: Efficient Keyframe Selection for Long Video Understanding
Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region. On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs. Code is available at https://github.com/NUS-HPC-AI-Lab/FOCUS.
♻ ☆ WorldLLM: Improving LLMs' world modeling using curiosity-driven theory-making
Large Language Models (LLMs) possess general world knowledge but often struggle to generate precise predictions in structured, domain-specific contexts such as simulations. These limitations arise from their inability to ground their broad, unstructured understanding in specific environments. To address this, we present WorldLLM, a framework that enhances LLM-based world modeling by combining Bayesian inference and autonomous active exploration with reinforcement learning. WorldLLM leverages the in-context learning abilities of LLMs to guide an LLM-based world model's predictions using natural language hypotheses given in its prompt. These hypotheses are iteratively refined through a Bayesian inference framework that leverages a second LLM as the proposal distribution given collected evidence. This evidence is collected using a curiosity-driven reinforcement learning policy that explores the environment to find transitions with a low log-likelihood under our LLM-based predictive model using the current hypotheses. By alternating between refining hypotheses and collecting new evidence, our framework autonomously drives continual improvement of the predictions. Our experiments demonstrate the effectiveness of WorldLLM in a textual game environment that requires agents to manipulate and combine objects. The framework not only enhances predictive accuracy, but also generates human-interpretable theories of environment dynamics.
♻ ☆ Fairness in Multi-modal Medical Diagnosis with Demonstration Selection
Multimodal large language models (MLLMs) have shown strong potential for medical image reasoning, yet fairness across demographic groups remains a major concern. Existing debiasing methods often rely on large labeled datasets or fine-tuning, which are impractical for foundation-scale models. We explore In-Context Learning (ICL) as a lightweight, tuning-free alternative for improving fairness. Through systematic analysis, we find that conventional demonstration selection (DS) strategies fail to ensure fairness due to demographic imbalance in selected exemplars. To address this, we propose Fairness-Aware Demonstration Selection (FADS), which builds demographically balanced and semantically relevant demonstrations via clustering-based sampling. Experiments on multiple medical imaging benchmarks show that FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining strong accuracy, offering an efficient and scalable path toward fair medical image reasoning. These results highlight the potential of fairness-aware in-context learning as a scalable and data-efficient solution for equitable medical image reasoning.
comment: 10 pages (including 2 pages of references), 4 figures. This work explores fairness in multi-modal medical image reasoning using in-context learning
♻ ☆ Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?
Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gödel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that LIVE-SWE-AGENT can achieve an impressive solve rate of 77.4% without test-time scaling, outperforming all existing software agents, including the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.
♻ ☆ Higher-Order Regularization Learning on Hypergraphs
Higher-Order Hypergraph Learning (HOHL) was recently introduced as a principled alternative to classical hypergraph regularization, enforcing higher-order smoothness via powers of multiscale Laplacians induced by the hypergraph structure. Prior work established the well- and ill-posedness of HOHL through an asymptotic consistency analysis in geometric settings. We extend this theoretical foundation by proving the consistency of a truncated version of HOHL and deriving explicit convergence rates when HOHL is used as a regularizer in fully supervised learning. We further demonstrate its strong empirical performance in active learning and in datasets lacking an underlying geometric structure, highlighting HOHL's versatility and robustness across diverse learning settings.
♻ ☆ Synthetic Counterfactual Labels for Efficient Conformal Counterfactual Inference
This work addresses the problem of constructing reliable prediction intervals for individual counterfactual outcomes. Existing conformal counterfactual inference (CCI) methods provide marginal coverage guarantees but often produce overly conservative intervals, particularly under treatment imbalance when counterfactual samples are scarce. We introduce synthetic data-powered CCI (SP-CCI), a new framework that augments the calibration set with synthetic counterfactual labels generated by a pre-trained counterfactual model. To ensure validity, SP-CCI incorporates synthetic samples into a conformal calibration procedure based on risk-controlling prediction sets (RCPS) with a debiasing step informed by prediction-powered inference (PPI). We prove that SP-CCI achieves tighter prediction intervals while preserving marginal coverage, with theoretical guarantees under both exact and approximate importance weighting. Empirical results on different datasets confirm that SP-CCI consistently reduces interval width compared to standard CCI across all settings.
♻ ☆ Analysis of Semi-Supervised Learning on Hypergraphs
Hypergraphs provide a natural framework for modeling higher-order interactions, yet their theoretical underpinnings in semi-supervised learning remain limited. We provide an asymptotic consistency analysis of variational learning on random geometric hypergraphs, precisely characterizing the conditions ensuring the well-posedness of hypergraph learning as well as showing convergence to a weighted $p$-Laplacian equation. Motivated by this, we propose Higher-Order Hypergraph Learning (HOHL), which regularizes via powers of Laplacians from skeleton graphs for multiscale smoothness. HOHL converges to a higher-order Sobolev seminorm. Empirically, it performs strongly on standard baselines.
♻ ☆ Layer-wise Weight Selection for Power-Efficient Neural Network Acceleration
Systolic array accelerators execute CNNs with energy dominated by the switching activity of multiply accumulate (MAC) units. Although prior work exploits weight dependent MAC power for compression, existing methods often use global activation models, coarse energy proxies, or layer-agnostic policies, which limits their effectiveness on real hardware. We propose an energy aware, layer-wise compression framework that explicitly leverages MAC and layer level energy characteristics. First, we build a layer-aware MAC energy model that combines per-layer activation statistics with an MSB-Hamming distance grouping of 22-bit partial sum transitions, and integrate it with a tile-level systolic mapping to estimate convolution-layer energy. On top of this model, we introduce an energy accuracy co-optimized weight selection algorithm within quantization aware training and an energy-prioritized layer-wise schedule that compresses high energy layers more aggressively under a global accuracy constraint. Experiments on different CNN models demonstrate up to 58.6\% energy reduction with 2-3\% accuracy drop, outperforming a state-of-the-art power-aware baseline.
♻ ☆ Don't Reach for the Stars: Rethinking Topology for Resilient Federated Learning
Federated learning (FL) enables collaborative model training across distributed clients while preserving data privacy by keeping data local. Traditional FL approaches rely on a centralized, star-shaped topology, where a central server aggregates model updates from clients. However, this architecture introduces several limitations, including a single point of failure, limited personalization, and poor robustness to distribution shifts or vulnerability to malfunctioning clients. Moreover, update selection in centralized FL often relies on low-level parameter differences, which can be unreliable when client data is not independent and identically distributed, and offer clients little control. In this work, we propose a decentralized, peer-to-peer (P2P) FL framework. It leverages the flexibility of the P2P topology to enable each client to identify and aggregate a personalized set of trustworthy and beneficial updates.This framework is the Local Inference Guided Aggregation for Heterogeneous Training Environments to Yield Enhancement Through Agreement and Regularization (LIGHTYEAR). Central to our method is an agreement score, computed on a local validation set, which quantifies the semantic alignment of incoming updates in the function space with respect to the clients reference model. Each client uses this score to select a tailored subset of updates and performs aggregation with a regularization term that further stabilizes the training. Our empirical evaluation across five datasets shows that the proposed approach consistently outperforms both, centralized baselines and existing P2P methods in terms of client-level performance, particularly under adversarial and heterogeneous conditions.
♻ ☆ In-Situ Tweedie Discrete Diffusion Models
While diffusion models excel at generating continuous data such as images, adapting them to discrete tasks has relied on indirect approaches that either operate in continuous embedding spaces or use token masking mechanisms, both of which deviate from modeling the true discrete data distribution that can be theoretically guaranteed by Tweedie's formula. We propose in-situ Tweedie Discrete Diffusion (TDD), a framework that performs diffusion guaranteed by Tweedie's formula directly within the discrete one-hot space, hence "in-situ." Unlike prior methods that diffuse continuous embeddings or mask tokens, TDD directly corrupts one-hot vectors with Gaussian noise and performs iterative denoising through a timestep-conditioned cross-entropy objective rather than mean-squared-error reconstruction. At each denoising step, the model predicts class probabilities, applies argmax to obtain discrete predictions, converts them to one-hot vectors, and feeds them into the next iteration with progressively reduced noise. This process naturally unifies discriminative classification and generative modeling under a single framework. Experiments demonstrate that TDD achieves strong performance on both image classification and text generation tasks, with extensive ablation studies confirming the effectiveness of each design component. Our work establishes a principled approach to discrete diffusion that preserves the core characteristics of diffusion models while operating natively in discrete space.
♻ ☆ A Goemans-Williamson type algorithm for identifying subcohorts in clinical trials
We design an efficient algorithm that outputs tests for identifying predominantly homogeneous subcohorts of patients from large in-homogeneous datasets. Our theoretical contribution is a rounding technique, similar to that of Goemans and Wiliamson (1995), that approximates the optimal solution within a factor of $0.82$. As an application, we use our algorithm to trade-off sensitivity for specificity to systematically identify clinically interesting homogeneous subcohorts of patients in the RNA microarray dataset for breast cancer from Curtis et al. (2012). One such clinically interesting subcohort suggests a link between LXR over-expression and BRCA2 and MSH6 methylation levels for patients in that subcohort.
♻ ☆ Principled Coarse-Grained Acceptance for Speculative Decoding in Speech
Speculative decoding accelerates autoregressive speech generation by letting a fast draft model propose tokens that a larger target model verifies. However, for speech LLMs that generate acoustic tokens, exact token matching is overly restrictive: many discrete tokens are acoustically or semantically interchangeable, reducing acceptance rates and limiting speedups. We introduce Principled Coarse-Graining (PCG), which verifies proposals at the level of Acoustic Similarity Groups (ASGs) derived from the target model's embedding space. By splitting each token's probability mass across the overlapping groups that contain it, we define an overlap-aware coarse-grained distribution and perform rejection sampling on the resulting group variable. This yields an exactness guarantee at the group level while allowing the accepted draft token to stand in for any member of the group in practice. On LibriTTS, PCG increases acceptance and throughput relative to standard speculative decoding and prior speech-specific relaxations while maintaining intelligibility and speaker similarity. These results suggest acoustically aware, group-level acceptance as a simple and general way to accelerate speech token generation while maintaining speech quality.
♻ ☆ Neural Scaling Laws for Deep Regression
Neural scaling laws--power-law relationships between generalization errors and characteristics of deep learning models--are vital tools for developing reliable models while managing limited resources. Although the success of large language models highlights the importance of these laws, their application to deep regression models remains largely unexplored. Here, we empirically investigate neural scaling laws in deep regression using a parameter estimation model for twisted van der Waals magnets. We observe power-law relationships between the loss and both training dataset size and model capacity across a wide range of values, employing various architectures--including fully connected networks, residual networks, and vision transformers. Furthermore, the scaling exponents governing these relationships range from 1 to 2, with specific values depending on the regressed parameters and model details. The consistent scaling behaviors and their large scaling exponents suggest that the performance of deep regression models can improve substantially with increasing data size.
comment: Supplementary Information will be provided with the published manuscript
♻ ☆ GiBy: A Giant-Step Baby-Step Classifier For Anomaly Detection In Industrial Control Systems
The continuous monitoring of the interactions between cyber-physical components of any industrial control system (ICS) is required to secure automation of the system controls, and to guarantee plant processes are fail-safe and remain in an acceptably safe state. Safety is achieved by managing actuation (where electric signals are used to trigger physical movement), dependent on corresponding sensor readings; used as ground truth in decision making. Timely detection of anomalies (attacks, faults and unascertained states) in ICSs is crucial for the safe running of a plant, the safety of its personnel, and for the safe provision of any services provided. We propose an anomaly detection method that involves accurate linearization of the non-linear forms arising from sensor-actuator(s) relationships, primarily because solving linear models is easier and well understood. We accomplish this by using a well-known water treatment testbed as a use case. Our experiments show millisecond time response to detect anomalies, all of which are explainable and traceable; this simultaneous coupling of detection speed and explainability has not been achieved by other state of the art Artificial Intelligence (AI)/ Machine Learning (ML) models with eXplainable AI (XAI) used for the same purpose. Our methods explainability enables us to pin-point the sensor(s) and the actuation state(s) for which the anomaly was detected. The proposed algorithm showed an accuracy of 97.72% by flagging deviations within safe operation limits as non-anomalous; indicative that slower detectors with highest detection resolution is unnecessary, for systems whose safety boundaries provide leeway within safety limits.
♻ ☆ Optimal Rates for Generalization of Gradient Descent for Deep ReLU Classification NeurIPS 2025
Recent advances have significantly improved our understanding of the generalization performance of gradient descent (GD) methods in deep neural networks. A natural and fundamental question is whether GD can achieve generalization rates comparable to the minimax optimal rates established in the kernel setting. Existing results either yield suboptimal rates of $O(1/\sqrt{n})$, or focus on networks with smooth activation functions, incurring exponential dependence on network depth $L$. In this work, we establish optimal generalization rates for GD with deep ReLU networks by carefully trading off optimization and generalization errors, achieving only polynomial dependence on depth. Specifically, under the assumption that the data are NTK separable from the margin $γ$, we prove an excess risk rate of $\widetilde{O}(L^4 (1 + γL^2) / (n γ^2))$, which aligns with the optimal SVM-type rate $\widetilde{O}(1 / (n γ^2))$ up to depth-dependent factors. A key technical contribution is our novel control of activation patterns near a reference model, enabling a sharper Rademacher complexity bound for deep ReLU networks trained with gradient descent.
comment: Published in NeurIPS 2025
♻ ☆ Mathematical Insights into Protein Architecture: Persistent Homology and Machine Learning Applied to the Flagellar Motor
We present a machine learning approach that leverages persistent homology to classify bacterial flagellar motors into two functional states: rotated and stalled. By embedding protein structural data into a topological framework, we extract multiscale features from filtered simplicial complexes constructed over atomic coordinates. These topological invariants, specifically persistence diagrams and barcodes, capture critical geometric and connectivity patterns that correlate with motor function. The extracted features are vectorized and integrated into a machine learning pipeline that includes dimensionality reduction and supervised classification. Applied to a curated dataset of experimentally characterized flagellar motors from diverse bacterial species, our model demonstrates high classification accuracy and robustness to structural variation. This approach highlights the power of topological data analysis in revealing functionally relevant patterns beyond the reach of traditional geometric descriptors, offering a novel computational tool for protein function prediction.
♻ ☆ Health App Reviews for Privacy & Trust (HARPT): A Corpus for Analyzing Patient Privacy Concerns, Trust in Providers and Trust in Applications
Background: User reviews of Telehealth and Patient Portal mobile applications (apps) hereon referred to as electronic health (eHealth) apps are a rich source of unsolicited patient feedback, revealing critical insights into patient perceptions. However, the lack of large-scale, annotated datasets specific to privacy and trust has limited the ability of researchers to systematically analyze these concerns using natural language processing (NLP) techniques. Objective: This study aims to develop and benchmark Health App Reviews for Privacy & Trust (HARPT), a large-scale annotated corpus of patient reviews from eHealth apps to advance research in patient privacy and trust. Methods: We employed a multistage data construction strategy. This integrated keyword-based filtering, iterative manual labeling with review, targeted data augmentation, and weak supervision using transformer-based classifiers. A curated subset of 7,000 reviews was manually annotated to support machine learning model development and evaluation. The resulting dataset was used to benchmark a broad range of models. Results: The HARPT corpus comprises 480,000 patient reviews annotated across seven categories capturing critical aspects of trust in the application (TA), trust in the provider (TP), and privacy concerns (PC). We provide comprehensive benchmark performance for a range of machine learning models on the manually annotated subset, establishing a baseline for future research. Conclusions: The HARPT corpus is a significant resource for advancing the study of privacy and trust in the eHealth domain. By providing a large-scale, annotated dataset and initial benchmarks, this work supports reproducible research in usable privacy and trust within health informatics. HARPT is released under an open resource license.
♻ ☆ Inferring response times of perceptual decisions with Poisson variational autoencoders NeurIPS 2025
Many properties of perceptual decision making are well-modeled by deep neural networks. However, such architectures typically treat decisions as instantaneous readouts, overlooking the temporal dynamics of the decision process. We present an image-computable model of perceptual decision making in which choices and response times arise from efficient sensory encoding and Bayesian decoding of neural spiking activity. We use a Poisson variational autoencoder to learn unsupervised representations of visual stimuli in a population of rate-coded neurons, modeled as independent homogeneous Poisson processes. A task-optimized decoder then continually infers an approximate posterior over actions conditioned on incoming spiking activity. Combining these components with an entropy-based stopping rule yields a principled and image-computable model of perceptual decisions capable of generating trial-by-trial patterns of choices and response times. Applied to MNIST digit classification, the model reproduces key empirical signatures of perceptual decision making, including stochastic variability, right-skewed response time distributions, logarithmic scaling of response times with the number of alternatives (Hick's law), and speed-accuracy trade-offs.
comment: To appear at the NeurIPS 2025 Workshop on Data on the Brain \& Mind
♻ ☆ FunReason: Enhancing Large Language Models' Function Calling via Self-Refinement Multiscale Loss and Automated Data Refinement
The integration of large language models (LLMs) with function calling has emerged as a crucial capability for enhancing their practical utility in real-world applications. However, effectively combining reasoning processes with accurate function execution remains a significant challenge. Traditional training approaches often struggle to balance the detailed reasoning steps with the precision of function calls, leading to suboptimal performance. To address these limitations, we introduce FunReason, a novel framework that enhances LLMs' function calling capabilities through an automated data refinement strategy and a Self-Refinement Multiscale Loss (SRML) approach. FunReason leverages LLMs' natural reasoning abilities to generate high-quality training examples, focusing on query parseability, reasoning coherence, and function call precision. The SRML approach dynamically balances the contribution of reasoning processes and function call accuracy during training, addressing the inherent trade-off between these two critical aspects. FunReason achieves performance comparable to GPT-4o while effectively mitigating catastrophic forgetting during fine-tuning. FunReason provides a comprehensive solution for enhancing LLMs' function calling capabilities by introducing a balanced training methodology and a data refinement pipeline. For code and dataset, please refer to our repository at GitHub https://github.com/BingguangHao/FunReason
♻ ☆ The inexact power augmented Lagrangian method for constrained nonconvex optimization
This work introduces an unconventional inexact augmented Lagrangian method where the augmenting term is a Euclidean norm raised to a power between one and two. The proposed algorithm is applicable to a broad class of constrained nonconvex minimization problems that involve nonlinear equality constraints. In a first part of this work, we conduct a full complexity analysis of the method under a mild regularity condition, leveraging an accelerated first-order algorithm for solving the Hölder-smooth subproblems. Interestingly, this worst-case result indicates that using lower powers for the augmenting term leads to faster constraint satisfaction, albeit with a slower decrease of the dual residual. Notably, our analysis does not assume boundedness of the iterates. Thereafter, we present an inexact proximal point method for solving the weakly-convex and Hölder-smooth subproblems, and demonstrate that the combined scheme attains an improved rate that reduces to the best-known convergence rate whenever the augmenting term is a classical squared Euclidean norm. Different augmenting terms, involving a lower power, further improve the primal complexity at the cost of the dual complexity. Finally, numerical experiments validate the practical performance of unconventional augmenting terms.
comment: Accepted for publication in Transactions on Machine Learning Research
♻ ☆ Beyond Predictions: A Participatory Framework for Multi-Stakeholder Decision-Making
Conventional automated decision-support systems often prioritize predictive accuracy, overlooking the complexities of real-world settings where stakeholders' preferences may diverge or conflict. This can lead to outcomes that disadvantage vulnerable groups and erode trust in algorithmic processes. Participatory AI approaches aim to address these issues but remain largely context-specific, limiting their broader applicability and scalability. To address these gaps, we propose a participatory framework that reframes decision-making as a multi-stakeholder learning and optimization problem. Our modular, model-agnostic approach builds on the standard machine learning training pipeline to fine-tune user-provided prediction models and evaluate decision strategies, including compromise functions that mediate stakeholder trade-offs. A synthetic scoring mechanism aggregates user-defined preferences across multiple metrics, ranking strategies and selecting an optimal decision-maker to generate actionable recommendations that jointly optimize performance, fairness, and domain-specific goals. Empirical validation on two high-stakes case studies demonstrates the versatility of the framework and its promise as a more accountable, context-aware alternative to prediction-centric pipelines for socially impactful deployments.
♻ ☆ Node Embeddings via Neighbor Embeddings
Node embeddings are a paradigm in non-parametric graph representation learning, where graph nodes are embedded into a given vector space to enable downstream processing. State-of-the-art node-embedding algorithms, such as DeepWalk and node2vec, are based on random-walk notions of node similarity and on contrastive learning. In this work, we introduce the graph neighbor-embedding (graph NE) framework that directly pulls together embedding vectors of adjacent nodes without relying on any random walks. We show that graph NE strongly outperforms state-of-the-art node-embedding algorithms in terms of local structure preservation. Furthermore, we apply graph NE to the 2D node-embedding problem, obtaining graph t-SNE layouts that also outperform existing graph-layout algorithms.
comment: Accepted to Transactions of Machine Learning Research (TMLR)
♻ ☆ (De)-regularized Maximum Mean Discrepancy Gradient Flow
We introduce a (de)-regularization of the Maximum Mean Discrepancy (DrMMD) and its Wasserstein gradient flow. Existing gradient flows that transport samples from source distribution to target distribution with only target samples, either lack tractable numerical implementation ($f$-divergence flows) or require strong assumptions, and modifications such as noise injection, to ensure convergence (Maximum Mean Discrepancy flows). In contrast, DrMMD flow can simultaneously (i) guarantee near-global convergence for a broad class of targets in both continuous and discrete time, and (ii) be implemented in closed form using only samples. The former is achieved by leveraging the connection between the DrMMD and the $χ^2$-divergence, while the latter comes by treating DrMMD as MMD with a de-regularized kernel. Our numerical scheme uses an adaptive de-regularization schedule throughout the flow to optimally trade off between discretization errors and deviations from the $χ^2$ regime. The potential application of the DrMMD flow is demonstrated across several numerical experiments, including a large-scale setting of training student/teacher networks.
♻ ☆ Forecasting-based Biomedical Time-series Data Synthesis for Open Data and Robust AI
The limited data availability due to strict privacy regulations and significant resource demands severely constrains biomedical time-series AI development, which creates a critical gap between data requirements and accessibility. Synthetic data generation presents a promising solution by producing artificial datasets that maintain the statistical properties of real biomedical time-series data without compromising patient confidentiality. While GANs, VAEs, and diffusion models capture global data distributions, forecasting models offer inductive biases tailored for sequential dynamics. We propose a framework for synthetic biomedical time-series data generation based on recent forecasting models that accurately replicates complex electrophysiological signals such as EEG and EMG with high fidelity. These synthetic datasets can be freely shared for open AI development and consistently improve downstream model performance. Numerical results on sleep-stage classification show up to a 3.71\% performance gain with augmentation and a 91.00\% synthetic-only accuracy that surpasses the real-data-only baseline.
comment: 22 pages
♻ ☆ Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment AAAI-26
Safety alignment instills in Large Language Models (LLMs) a critical capacity to refuse malicious requests. Prior works have modeled this refusal mechanism as a single linear direction in the activation space. We posit that this is an oversimplification that conflates two functionally distinct neural processes: the detection of harm and the execution of a refusal. In this work, we deconstruct this single representation into a Harm Detection Direction and a Refusal Execution Direction. Leveraging this fine-grained model, we introduce Differentiated Bi-Directional Intervention (DBDI), a new white-box framework that precisely neutralizes the safety alignment at critical layer. DBDI applies adaptive projection nullification to the refusal execution direction while suppressing the harm detection direction via direct steering. Extensive experiments demonstrate that DBDI outperforms prominent jailbreaking methods, achieving up to a 97.88\% attack success rate on models such as Llama-2. By providing a more granular and mechanistic framework, our work offers a new direction for the in-depth understanding of LLM safety alignment.
comment: AAAI-26-AIA
♻ ☆ Causally Reliable Concept Bottleneck Models NeurIPS 2025
Concept-based models are an emerging paradigm in deep learning that constrains the inference process to operate through human-interpretable variables, facilitating explainability and human interaction. However, these architectures, on par with popular opaque neural models, fail to account for the true causal mechanisms underlying the target phenomena represented in the data. This hampers their ability to support causal reasoning tasks, limits out-of-distribution generalization, and hinders the implementation of fairness constraints. To overcome these issues, we propose Causally reliable Concept Bottleneck Models (C$^2$BMs), a class of concept-based architectures that enforce reasoning through a bottleneck of concepts structured according to a model of the real-world causal mechanisms. We also introduce a pipeline to automatically learn this structure from observational data and unstructured background knowledge (e.g., scientific literature). Experimental evidence suggests that C$^2$BMs are more interpretable, causally reliable, and improve responsiveness to interventions w.r.t. standard opaque and concept-based models, while maintaining their accuracy.
comment: Accepted at NeurIPS 2025
♻ ☆ When, Where and Why to Average Weights?
Averaging checkpoints along the training trajectory is a simple yet powerful approach to improve the generalization performance of Machine Learning models and reduce training time. Motivated by these potential gains, and in an effort to fairly and thoroughly benchmark this technique, we present an extensive evaluation of averaging techniques in modern Deep Learning, which we perform using AlgoPerf \citep{dahl_benchmarking_2023}, a large-scale benchmark for optimization algorithms. We investigate whether weight averaging can reduce training time, improve generalization, and replace learning rate decay, as suggested by recent literature. Our evaluation across seven architectures and datasets reveals that averaging significantly accelerates training and yields considerable efficiency gains, at the price of a minimal implementation and memory cost, while mildly improving generalization across all considered workloads. Finally, we explore the relationship between averaging and learning rate annealing and show how to optimally combine the two to achieve the best performances.
♻ ☆ Counterfactual Explainable AI (XAI) Method for Deep Learning-Based Multivariate Time Series Classification AAAI 2026
Recent advances in deep learning have improved multivariate time series (MTS) classification and regression by capturing complex patterns, but their lack of transparency hinders decision-making. Explainable AI (XAI) methods offer partial insights, yet often fall short of conveying the full decision space. Counterfactual Explanations (CE) provide a promising alternative, but current approaches typically prioritize either accuracy, proximity or sparsity -- rarely all -- limiting their practical value. To address this, we propose CONFETTI, a novel multi-objective CE method for MTS. CONFETTI identifies key MTS subsequences, locates a counterfactual target, and optimally modifies the time series to balance prediction confidence, proximity and sparsity. This method provides actionable insights with minimal changes, improving interpretability, and decision support. CONFETTI is evaluated on seven MTS datasets from the UEA archive, demonstrating its effectiveness in various domains. CONFETTI consistently outperforms state-of-the-art CE methods in its optimization objectives, and in six other metrics from the literature, achieving $\geq10\%$ higher confidence while improving sparsity in $\geq40\%$.
comment: Accepted in AAAI 2026 Technical Main Track
♻ ☆ SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference
Mixture-of-Experts (MoE) models improve the scalability of large language models (LLMs) by activating only a small subset of relevant experts per input. However, the sheer number of expert networks in an MoE model introduces a significant storage burden for an edge device. To address this challenge, we consider a scenario where experts are dispersed across an edge network for distributed inference. Based on the popular Top-$K$ expert selection strategy, we formulate a latency minimization problem by optimizing expert caching on edge servers under storage constraints. When $K=1$, the problem reduces to a monotone submodular maximization problem with knapsack constraints, for which we design a greedy-based algorithm with a $(1 - 1/e)$-approximation guarantee. For the general case where $K \geq 1$, expert co-activation within the same MoE layer introduces non-submodularity, which renders greedy methods ineffective. To tackle this issue, we propose a successive greedy decomposition method to decompose the original problem into a series of subproblems, with each being solved by a dynamic programming approach. Furthermore, we design an accelerated algorithm based on the max-convolution technique to obtain the approximate solution with a provable guarantee in polynomial time. Simulation results on various MoE models demonstrate that our method significantly reduces inference latency compared to existing baselines.
comment: 17 pages, 11 figures. This work has been submitted to the IEEE for possible publication
♻ ☆ Interpretability of Graph Neural Networks to Assess Effects of Global Change Drivers on Ecological Networks
Pollinators play a crucial role for plant reproduction, either in natural ecosystem or in human-modified landscape. Global change drivers,including climate change or land use modifications, can alter the plant-pollinator interactions. To assess the potential influence of global change drivers on pollination, large-scale interactions, climate and land use data are required. While recent machine learning methods, such as graph neural networks (GNNs), allow the analysis of such datasets, interpreting their results can be challenging. We explore existing methods for interpreting GNNs in order to highlight the effects of various environmental covariates on pollination network connectivity. An extensive simulation study is performed to confirm whether these methods can detect the interactive effect between a covariate and a genus of plant on connectivity, and whether the application of debiasing techniques influences the estimation of these effects. An application on the Spipoll dataset, with and without accounting for sampling effects, highlights the potential impact of land use on network connectivity and shows that accounting for sampling effects partially alters the estimation of these effects.
♻ ☆ Learning Potential Energy Surfaces of Hydrogen Atom Transfer Reactions in Peptides
Hydrogen atom transfer (HAT) reactions are essential in many biological processes, such as radical migration in damaged proteins, but their mechanistic pathways remain incompletely understood. Simulating HAT is challenging due to the need for quantum chemical accuracy at biologically relevant scales; thus, neither classical force fields nor DFT-based molecular dynamics are applicable. Machine-learned potentials offer an alternative, able to learn potential energy surfaces (PESs) with near-quantum accuracy. However, training these models to generalize across diverse HAT configurations, especially at radical positions in proteins, requires tailored data generation and careful model selection. Here, we systematically generate HAT configurations in peptides to build large datasets using semiempirical methods and DFT. We benchmark three graph neural network architectures (SchNet, Allegro, and MACE) on their ability to learn HAT PESs and indirectly predict reaction barriers from energy predictions. MACE consistently outperforms the others in energy, force, and barrier prediction, achieving a mean absolute error of 1.13 kcal/mol on out-of-distribution DFT barrier predictions. Using molecular dynamics, we show our MACE potential is stable, reactive, and generalizes beyond training data to model HAT barriers in collagen I. This accuracy enables integration of ML potentials into large-scale collagen simulations to compute reaction rates from predicted barriers, advancing mechanistic understanding of HAT and radical migration in peptides. We analyze scaling laws, model transferability, and cost-performance trade-offs, and outline strategies for improvement by combining ML potentials with transition state search algorithms and active learning. Our approach is generalizable to other biomolecular systems, enabling quantum-accurate simulations of chemical reactivity in complex environments.
comment: 20 pages, 12 figures, and 4 tables (references and SI included)
♻ ☆ On the dimension of pullback attractors in recurrent neural networks
Recurrent Neural Networks (RNNs) are high-dimensional state space models capable of learning functions on sequence data. Recently, it has been conjectured that reservoir computers, a particular class of RNNs, trained on observations of a dynamical systems can be interpreted as embeddings. This result has been established for the case of linear reservoir systems. In this work, we use a nonautonomous dynamical systems approach to establish an upper bound for the fractal dimension of the subset of reservoir state space approximated during training and prediction phase. We prove that when the input sequences comes from an Nin-dimensional invertible dynamical system, the fractal dimension of this set is bounded above by Nin. The result obtained here are useful in dimensionality reduction of computation in RNNs as well as estimating fractal dimensions of dynamical systems from limited observations of their time series. It is also a step towards understanding embedding properties of reservoir computers.
comment: Issues with clarity and notation
♻ ☆ General-Purpose Models for the Chemical Sciences: LLMs and Beyond
Data-driven techniques have a large potential to transform and accelerate the chemical sciences. However, chemical sciences also pose the unique challenge of very diverse, small, fuzzy datasets that are difficult to leverage in conventional machine learning approaches. A new class of models, which can be summarized under the term general-purpose models (GPMs) such as large language models, has shown the ability to solve tasks they have not been directly trained on, and to flexibly operate with low amounts of data in different formats. In this review, we discuss fundamental building principles of GPMs and review recent and emerging applications of those models in the chemical sciences across the entire scientific process. While many of these applications are still in the prototype phase, we expect that the increasing interest in GPMs will make many of them mature in the coming years.
♻ ☆ Human Cognition Inspired RAG with Knowledge Graph for Complex Problem Solving AAAI 2026
Large Language Models (LLMs) have demonstrated significant potential across various domains. However, they often struggle with integrating external knowledge and performing complex reasoning, leading to hallucinations and unreliable outputs. Retrieval Augmented Generation (RAG) has emerged as a promising paradigm to mitigate these issues by incorporating external knowledge. Yet, conventional RAG approaches, especially those based on vector similarity, fail to effectively capture relational dependencies and support multi-step reasoning. In this work, we propose CogGRAG, a human cognition-inspired, graph-based RAG framework designed for Knowledge Graph Question Answering (KGQA). CogGRAG models the reasoning process as a tree-structured mind map that decomposes the original problem into interrelated subproblems and explicitly encodes their semantic relationships. This structure not only provides a global view to guide subsequent retrieval and reasoning but also enables self-consistent verification across reasoning paths. The framework operates in three stages: (1) top-down problem decomposition via mind map construction, (2) structured retrieval of both local and global knowledge from external Knowledge Graphs (KGs), and (3) bottom-up reasoning with dual-process self-verification. Unlike previous tree-based decomposition methods such as MindMap or Graph-CoT, CogGRAG unifies problem decomposition, knowledge retrieval, and reasoning under a single graph-structured cognitive framework, allowing early integration of relational knowledge and adaptive verification. Extensive experiments demonstrate that CogGRAG achieves superior accuracy and reliability compared to existing methods.
comment: The paper has been accepted by AAAI 2026
♻ ☆ Compressing Sensor Data for Remote Assistance of Autonomous Vehicles using Deep Generative Models NeurIPS 2021
In the foreseeable future, autonomous vehicles will require human assistance in situations they can not resolve on their own. In such scenarios, remote assistance from a human can provide the required input for the vehicle to continue its operation. Typical sensors used in autonomous vehicles include camera and lidar sensors. Due to the massive volume of sensor data that must be sent in real-time, highly efficient data compression is elementary to prevent an overload of network infrastructure. Sensor data compression using deep generative neural networks has been shown to outperform traditional compression approaches for both image and lidar data, regarding compression rate as well as reconstruction quality. However, there is a lack of research about the performance of generative-neural-network-based compression algorithms for remote assistance. In order to gain insights into the feasibility of deep generative models for usage in remote assistance, we evaluate state-of-the-art algorithms regarding their applicability and identify potential weaknesses. Further, we implement an online pipeline for processing sensor data and demonstrate its performance for remote assistance using the CARLA simulator.
comment: Daniel Bogdoll, Johannes Jestram, Jonas Rauch, Christin Scheib and Moritz Wittig contributed equally. Accepted for publication at NeurIPS 2021 ML4AD Workshop
♻ ☆ Description of Corner Cases in Automated Driving: Goals and Challenges ICCV 2021
Scaling the distribution of automated vehicles requires handling various unexpected and possibly dangerous situations, termed corner cases (CC). Since many modules of automated driving systems are based on machine learning (ML), CC are an essential part of the data for their development. However, there is only a limited amount of CC data in large-scale data collections, which makes them challenging in the context of ML. With a better understanding of CC, offline applications, e.g., dataset analysis, and online methods, e.g., improved performance of automated driving systems, can be improved. While there are knowledge-based descriptions and taxonomies for CC, there is little research on machine-interpretable descriptions. In this extended abstract, we will give a brief overview of the challenges and goals of such a description.
comment: Daniel Bogdoll, Jasmin Breitenstein and Florian Heidecker contributed equally. Accepted for publication at ICCV 2021 ERCVAD Workshop
♻ ☆ BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models AAAI 2026
Although large language models (LLMs) demonstrate impressive proficiency in various tasks, they present potential safety risks, such as `jailbreaks', where malicious inputs can coerce LLMs into generating harmful content bypassing safety alignments. In this paper, we delve into the ethical biases in LLMs and examine how those biases could be exploited for jailbreaks. Notably, these biases result in a jailbreaking success rate in GPT-4o models that differs by 20\% between non-binary and cisgender keywords and by 16\% between white and black keywords, even when the other parts of the prompts are identical. We introduce the concept of BiasJailbreak, highlighting the inherent risks posed by these safety-induced biases. BiasJailbreak generates biased keywords automatically by asking the target LLM itself, and utilizes the keywords to generate harmful output. Additionally, we propose an efficient defense method BiasDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation. BiasDefense stands as an appealing alternative to Guard Models, such as Llama-Guard, that require additional inference cost after text generation. Our findings emphasize that ethical biases in LLMs can actually lead to generating unsafe output, and suggest a method to make the LLMs more secure and unbiased. To enable further research and improvements, we open-source our code and artifacts of BiasJailbreak, providing the community with tools to better understand and mitigate safety-induced biases in LLMs.
comment: Accepted as a workshop paper at AAAI 2026
♻ ☆ VeML: An End-to-End Machine Learning Lifecycle for Large-scale and High-dimensional Data
An end-to-end machine learning (ML) lifecycle consists of many iterative processes, from data preparation and ML model design to model training and then deploying the trained model for inference. When building an end-to-end lifecycle for an ML problem, many ML pipelines must be designed and executed that produce a huge number of lifecycle versions. Therefore, this paper introduces VeML, a Version management system dedicated to end-to-end ML Lifecycle. Our system tackles several crucial problems that other systems have not solved. First, we address the high cost of building an ML lifecycle, especially for large-scale and high-dimensional dataset. We solve this problem by proposing to transfer the lifecycle of similar datasets managed in our system to the new training data. We design an algorithm based on the core set to compute similarity for large-scale, high-dimensional data efficiently. Another critical issue is the model accuracy degradation by the difference between training data and testing data during the ML lifetime, which leads to lifecycle rebuild. Our system helps to detect this mismatch without getting labeled data from testing data and rebuild the ML lifecycle for a new data version. To demonstrate our contributions, we conduct experiments on real-world, large-scale datasets of driving images and spatiotemporal sensor data and show promising results.
comment: The updated version of this paper, titled "Efficient ML Lifecycle Transferring for Large-scale and High-dimensional Data via Core Set-based Dataset Similarity," has been accepted for publication in IEEE Access
Computer Vision and Pattern Recognition
☆ LumiTex: Towards High-Fidelity PBR Texture Generation with Illumination Context
Physically-based rendering (PBR) provides a principled standard for realistic material-lighting interactions in computer graphics. Despite recent advances in generating PBR textures, existing methods fail to address two fundamental challenges: 1) materials decomposition from image prompts under limited illumination cues, and 2) seamless and view-consistent texture completion. To this end, we propose LumiTex, an end-to-end framework that comprises three key components: (1) a multi-branch generation scheme that disentangles albedo and metallic-roughness under shared illumination priors for robust material understanding, (2) a lighting-aware material attention mechanism that injects illumination context into the decoding process for physically grounded generation of albedo, metallic, and roughness maps, and (3) a geometry-guided inpainting module based on a large view synthesis model that enriches texture coverage and ensures seamless, view-consistent UV completion. Extensive experiments demonstrate that LumiTex achieves state-of-the-art performance in texture quality, surpassing both existing open-source and commercial methods.
comment: Project page: https://lumitex.vercel.app
☆ VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection
We present VDC-Agent, a self-evolving framework for Video Detailed Captioning that requires neither human annotations nor larger teacher models. The agent forms a closed loop of caption generation, principle-guided scoring (score and textual suggestions), and prompt refinement. When caption quality regresses, a self-reflection path leverages the previous chain-of-thought to amend the update. Running this process on unlabeled videos produces trajectories of (caption, score) pairs. We convert the trajectories into preference tuples and filter out samples with JSON parsing errors, resulting in VDC-Agent-19K, which contains 18,886 automatically constructed pairs. We then fine-tune the base MLLM on this dataset using an easy-to-hard curriculum direct preference optimization. Built on Qwen2.5-VL-7B-Instruct, our VDC-Agent-7B attains state-of-the-art performance on the VDC benchmark with 49.08% average accuracy and 2.50 score, surpassing specialized video captioners and improving over the base model by +5.13% accuracy and +0.27 score at similar inference cost.
☆ Are Image-to-Video Models Good Zero-Shot Image Editors?
Large-scale video diffusion models show strong world simulation and temporal reasoning abilities, but their use as zero-shot image editors remains underexplored. We introduce IF-Edit, a tuning-free framework that repurposes pretrained image-to-video diffusion models for instruction-driven image editing. IF-Edit addresses three key challenges: prompt misalignment, redundant temporal latents, and blurry late-stage frames. It includes (1) a chain-of-thought prompt enhancement module that transforms static editing instructions into temporally grounded reasoning prompts; (2) a temporal latent dropout strategy that compresses frame latents after the expert-switch point, accelerating denoising while preserving semantic and temporal coherence; and (3) a self-consistent post-refinement step that sharpens late-stage frames using a short still-video trajectory. Experiments on four public benchmarks, covering non-rigid editing, physical and temporal reasoning, and general instruction edits, show that IF-Edit performs strongly on reasoning-centric tasks while remaining competitive on general-purpose edits. Our study provides a systematic view of video diffusion models as image editors and highlights a simple recipe for unified video-image generative reasoning.
comment: technical report
☆ Breaking the Likelihood-Quality Trade-off in Diffusion Models by Merging Pretrained Experts ICLR 2025
Diffusion models for image generation often exhibit a trade-off between perceptual sample quality and data likelihood: training objectives emphasizing high-noise denoising steps yield realistic images but poor likelihoods, whereas likelihood-oriented training overweights low-noise steps and harms visual fidelity. We introduce a simple plug-and-play sampling method that combines two pretrained diffusion experts by switching between them along the denoising trajectory. Specifically, we apply an image-quality expert at high noise levels to shape global structure, then switch to a likelihood expert at low noise levels to refine pixel statistics. The approach requires no retraining or fine-tuning -- only the choice of an intermediate switching step. On CIFAR-10 and ImageNet32, the merged model consistently matches or outperforms its base components, improving or preserving both likelihood and sample quality relative to each expert alone. These results demonstrate that expert switching across noise levels is an effective way to break the likelihood-quality trade-off in image diffusion models.
comment: ICLR 2025 DeLTa workshop
☆ Mixture of Horizons in Action Chunking
Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $π_0$, $π_{0.5}$, and one-step regression policy $π_{\text{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $π_{0.5}$ with MoH reaches a new state-of-the-art with 99$\%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://github.com/Timsty1/MixtureOfHorizons
comment: 15 pages, 14 figures
☆ Cloud4D NeurIPS 2025
There has been great progress in improving numerical weather prediction and climate models using machine learning. However, most global models act at a kilometer-scale, making it challenging to model individual clouds and factors such as extreme precipitation, wind gusts, turbulence, and surface irradiance. Therefore, there is a need to move towards higher-resolution models, which in turn require high-resolution real-world observations that current instruments struggle to obtain. We present Cloud4D, the first learning-based framework that reconstructs a physically consistent, four-dimensional cloud state using only synchronized ground-based cameras. Leveraging a homography-guided 2D-to-3D transformer, Cloud4D infers the full 3D distribution of liquid water content at 25 m spatial and 5 s temporal resolution. By tracking the 3D liquid water content retrievals over time, Cloud4D additionally estimates horizontal wind vectors. Across a two-month deployment comprising six skyward cameras, our system delivers an order-of-magnitude improvement in space-time resolution relative to state-of-the-art satellite measurements, while retaining single-digit relative error ($<10\%$) against collocated radar measurements. Code and data are available on our project page https://cloud4d.jacob-lin.com/.
comment: NeurIPS 2025 Spotlight, project page: https://cloud4d.jacob-lin.com/
☆ Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution AAAI 2026
Task scheduling is critical for embodied AI, enabling agents to follow natural language instructions and execute actions efficiently in 3D physical worlds. However, existing datasets often simplify task planning by ignoring operations research (OR) knowledge and 3D spatial grounding. In this work, we propose Operations Research knowledge-based 3D Grounded Task Scheduling (ORS3D), a new task that requires the synergy of language understanding, 3D grounding, and efficiency optimization. Unlike prior settings, ORS3D demands that agents minimize total completion time by leveraging parallelizable subtasks, e.g., cleaning the sink while the microwave operates. To facilitate research on ORS3D, we construct ORS3D-60K, a large-scale dataset comprising 60K composite tasks across 4K real-world scenes. Furthermore, we propose GRANT, an embodied multi-modal large language model equipped with a simple yet effective scheduling token mechanism to generate efficient task schedules and grounded actions. Extensive experiments on ORS3D-60K validate the effectiveness of GRANT across language understanding, 3D grounding, and scheduling efficiency. The code is available at https://github.com/H-EmbodVis/GRANT
comment: Accepted to AAAI 2026 (Oral). The code is available at \url{https://github.com/H-EmbodVis/GRANT}
☆ Flow Map Distillation Without Data
State-of-the-art flow models achieve remarkable quality but require slow, iterative sampling. To accelerate this, flow maps can be distilled from pre-trained teachers, a procedure that conventionally requires sampling from an external dataset. We argue that this data-dependency introduces a fundamental risk of Teacher-Data Mismatch, as a static dataset may provide an incomplete or even misaligned representation of the teacher's full generative capabilities. This leads us to question whether this reliance on data is truly necessary for successful flow map distillation. In this work, we explore a data-free alternative that samples only from the prior distribution, a distribution the teacher is guaranteed to follow by construction, thereby circumventing the mismatch risk entirely. To demonstrate the practical viability of this philosophy, we introduce a principled framework that learns to predict the teacher's sampling path while actively correcting for its own compounding errors to ensure high fidelity. Our approach surpasses all data-based counterparts and establishes a new state-of-the-art by a significant margin. Specifically, distilling from SiT-XL/2+REPA, our method reaches an impressive FID of 1.45 on ImageNet 256x256, and 1.49 on ImageNet 512x512, both with only 1 sampling step. We hope our work establishes a more robust paradigm for accelerating generative models and motivates the broader adoption of flow map distillation without data.
☆ Ref-SAM3D: Bridging SAM3D with Text for Reference 3D Reconstruction
SAM3D has garnered widespread attention for its strong 3D object reconstruction capabilities. However, a key limitation remains: SAM3D cannot reconstruct specific objects referred to by textual descriptions, a capability that is essential for practical applications such as 3D editing, game development, and virtual environments. To address this gap, we introduce Ref-SAM3D, a simple yet effective extension to SAM3D that incorporates textual descriptions as a high-level prior, enabling text-guided 3D reconstruction from a single RGB image. Through extensive qualitative experiments, we show that Ref-SAM3D, guided only by natural language and a single 2D view, delivers competitive and high-fidelity zero-shot reconstruction performance. Our results demonstrate that Ref-SAM3D effectively bridges the gap between 2D visual cues and 3D geometric understanding, offering a more flexible and accessible paradigm for reference-guided 3D reconstruction. Code is available at: https://github.com/FudanCVL/Ref-SAM3D.
comment: Code: https://github.com/FudanCVL/Ref-SAM3D
☆ SAM3-Adapter: Efficient Adaptation of Segment Anything 3 for Camouflage Object Segmentation, Shadow Detection, and Medical Image Segmentation
The rapid rise of large-scale foundation models has reshaped the landscape of image segmentation, with models such as Segment Anything achieving unprecedented versatility across diverse vision tasks. However, previous generations-including SAM and its successor-still struggle with fine-grained, low-level segmentation challenges such as camouflaged object detection, medical image segmentation, cell image segmentation, and shadow detection. To address these limitations, we originally proposed SAM-Adapter in 2023, demonstrating substantial gains on these difficult scenarios. With the emergence of Segment Anything 3 (SAM3)-a more efficient and higher-performing evolution with a redesigned architecture and improved training pipeline-we revisit these long-standing challenges. In this work, we present SAM3-Adapter, the first adapter framework tailored for SAM3 that unlocks its full segmentation capability. SAM3-Adapter not only reduces computational overhead but also consistently surpasses both SAM and SAM2-based solutions, establishing new state-of-the-art results across multiple downstream tasks, including medical imaging, camouflaged (concealed) object segmentation, and shadow detection. Built upon the modular and composable design philosophy of the original SAM-Adapter, SAM3-Adapter provides stronger generalizability, richer task adaptability, and significantly improved segmentation precision. Extensive experiments confirm that integrating SAM3 with our adapter yields superior accuracy, robustness, and efficiency compared to all prior SAM-based adaptations. We hope SAM3-Adapter can serve as a foundation for future research and practical segmentation applications. Code, pre-trained models, and data processing pipelines are available.
☆ Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
Vision-Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness. This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, COVT distills knowledge from lightweight vision experts, capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, the VLM with COVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, including CV-Bench, MMVP, RealWorldQA, MMStar, WorldMedQA, and HRBench, integrating COVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance by 3% to 16% and demonstrates that compact continuous visual thinking enables more precise, grounded, and interpretable multimodal intelligence.
comment: Project page: https://wakalsprojectpage.github.io/comt-website/
☆ UniGame: Turning a Unified Multimodal Model Into Its Own Adversary
Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02), out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/UniGame
☆ In-Video Instructions: Visual Signals as Generative Control
Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.
☆ Real-Time Object Tracking with On-Device Deep Learning for Adaptive Beamforming in Dynamic Acoustic Environments
Advances in object tracking and acoustic beamforming are driving new capabilities in surveillance, human-computer interaction, and robotics. This work presents an embedded system that integrates deep learning-based tracking with beamforming to achieve precise sound source localization and directional audio capture in dynamic environments. The approach combines single-camera depth estimation and stereo vision to enable accurate 3D localization of moving objects. A planar concentric circular microphone array constructed with MEMS microphones provides a compact, energy-efficient platform supporting 2D beam steering across azimuth and elevation. Real-time tracking outputs continuously adapt the array's focus, synchronizing the acoustic response with the target's position. By uniting learned spatial awareness with dynamic steering, the system maintains robust performance in the presence of multiple or moving sources. Experimental evaluation demonstrates significant gains in signal-to-interference ratio, making the design well-suited for teleconferencing, smart home devices, and assistive technologies.
☆ BackSplit: The Importance of Sub-dividing the Background in Biomedical Lesion Segmentation
Segmenting small lesions in medical images remains notoriously difficult. Most prior work tackles this challenge by either designing better architectures, loss functions, or data augmentation schemes; and collecting more labeled data. We take a different view, arguing that part of the problem lies in how the background is modeled. Common lesion segmentation collapses all non-lesion pixels into a single "background" class, ignoring the rich anatomical context in which lesions appear. In reality, the background is highly heterogeneous-composed of tissues, organs, and other structures that can now be labeled manually or inferred automatically using existing segmentation models. In this paper, we argue that training with fine-grained labels that sub-divide the background class, which we call BackSplit, is a simple yet powerful paradigm that can offer a significant performance boost without increasing inference costs. From an information theoretic standpoint, we prove that BackSplit increases the expected Fisher Information relative to conventional binary training, leading to tighter asymptotic bounds and more stable optimization. With extensive experiments across multiple datasets and architectures, we empirically show that BackSplit consistently boosts small-lesion segmentation performance, even when auxiliary labels are generated automatically using pretrained segmentation models. Additionally, we demonstrate that auxiliary labels derived from interactive segmentation frameworks exhibit the same beneficial effect, demonstrating its robustness, simplicity, and broad applicability.
☆ UISearch: Graph-Based Embeddings for Multimodal Enterprise UI Screenshots Retrieval
Enterprise software companies maintain thousands of user interface screens across products and versions, creating critical challenges for design consistency, pattern discovery, and compliance check. Existing approaches rely on visual similarity or text semantics, lacking explicit modeling of structural properties fundamental to user interface (UI) composition. We present a novel graph-based representation that converts UI screenshots into attributed graphs encoding hierarchical relationships and spatial arrangements, potentially generalizable to document layouts, architectural diagrams, and other structured visual domains. A contrastive graph autoencoder learns embeddings preserving multi-level similarity across visual, structural, and semantic properties. The comprehensive analysis demonstrates that our structural embeddings achieve better discriminative power than state-of-the-art Vision Encoders, representing a fundamental advance in the expressiveness of the UI representation. We implement this representation in UISearch, a multi-modal search framework that combines structural embeddings with semantic search through a composable query language. On 20,396 financial software UIs, UISearch achieves 0.92 Top-5 accuracy with 47.5ms median latency (P95: 124ms), scaling to 20,000+ screens. The hybrid indexing architecture enables complex queries and supports fine-grained UI distinction impossible with vision-only approaches.
comment: 12 pages, 2 figures, 3 algorithms, 4 tables
☆ An Anatomy Aware Hybrid Deep Learning Framework for Lung Cancer Tumor Stage Classification
Accurate lung cancer tumor staging is crucial for prognosis and treatment planning. However, it remains challenging for end-to-end deep learning approaches, as such approaches often overlook spatial and anatomical information that are central to the tumor-node-metastasis system. The tumor stage depends on multiple quantitative criteria, including the tumor size and its proximity to the nearest anatomical structures, and small variations can alter the staging outcome. We propose a medically grounded hybrid pipeline that performs staging by explicitly measuring the tumor's size and distance properties rather than treating it as a pure image classification task. Our method employs specialized encoder-decoder networks to precisely segment the lung and adjacent anatomy, including the lobes, tumor, mediastinum, and diaphragm. Subsequently, we extract the necessary tumor properties, i.e. measure the largest tumor dimension and calculate the distance between the tumor and neighboring anatomical structures by a quantitative analysis of the segmentation masks. Finally, we apply rule-based tumor staging aligned with the medical guidelines. This novel framework has been evaluated on the Lung-PET-CT-Dx dataset, demonstrating superior performance compared to traditional deep learning models, achieving an overall classification accuracy of 91.36%. We report the per-stage F1-scores of 0.93 (T1), 0.89 (T2), 0.96 (T3), and 0.90 (T4), a critical evaluation aspect often omitted in prior literature. To our knowledge, this is the first study that embeds explicit clinical context into tumor stage classification. Unlike standard convolutional neural networks that operate in an uninterpretable "black box" manner, our method offers both state-of-the-art performance and transparent decision support.
☆ DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison. Codes are publicly available at https://github.com/Zehong-Ma/DeCo.
comment: Project Page: https://zehong-ma.github.io/DeCo. Code Repository: https://github.com/Zehong-Ma/DeCo
☆ Growing with the Generator: Self-paced GRPO for Video Generation
Group Relative Policy Optimization (GRPO) has emerged as a powerful reinforcement learning paradigm for post-training video generation models. However, existing GRPO pipelines rely on static, fixed-capacity reward models whose evaluation behavior is frozen during training. Such rigid rewards introduce distributional bias, saturate quickly as the generator improves, and ultimately limit the stability and effectiveness of reinforcement-based alignment. We propose Self-Paced GRPO, a competence-aware GRPO framework in which reward feedback co-evolves with the generator. Our method introduces a progressive reward mechanism that automatically shifts its emphasis from coarse visual fidelity to temporal coherence and fine-grained text-video semantic alignment as generation quality increases. This self-paced curriculum alleviates reward-policy mismatch, mitigates reward exploitation, and yields more stable optimization. Experiments on VBench across multiple video generation backbones demonstrate consistent improvements in both visual quality and semantic alignment over GRPO baselines with static rewards, validating the effectiveness and generality of Self-Paced GRPO.
☆ CellFMCount: A Fluorescence Microscopy Dataset, Benchmark, and Methods for Cell Counting ICDM
Accurate cell counting is essential in various biomedical research and clinical applications, including cancer diagnosis, stem cell research, and immunology. Manual counting is labor-intensive and error-prone, motivating automation through deep learning techniques. However, training reliable deep learning models requires large amounts of high-quality annotated data, which is difficult and time-consuming to produce manually. Consequently, existing cell-counting datasets are often limited, frequently containing fewer than $500$ images. In this work, we introduce a large-scale annotated dataset comprising $3{,}023$ images from immunocytochemistry experiments related to cellular differentiation, containing over $430{,}000$ manually annotated cell locations. The dataset presents significant challenges: high cell density, overlapping and morphologically diverse cells, a long-tailed distribution of cell count per image, and variation in staining protocols. We benchmark three categories of existing methods: regression-based, crowd-counting, and cell-counting techniques on a test set with cell counts ranging from $10$ to $2{,}126$ cells per image. We also evaluate how the Segment Anything Model (SAM) can be adapted for microscopy cell counting using only dot-annotated datasets. As a case study, we implement a density-map-based adaptation of SAM (SAM-Counter) and report a mean absolute error (MAE) of $22.12$, which outperforms existing approaches (second-best MAE of $27.46$). Our results underscore the value of the dataset and the benchmarking framework for driving progress in automated cell counting and provide a robust foundation for future research and development.
comment: The IEEE International Conference on Data Mining (ICDM) 2025
☆ Syn-GRPO: Self-Evolving Data Synthesis for MLLM Perception Reasoning
RL (reinforcement learning) methods (e.g., GRPO) for MLLM (Multimodal LLM) perception ability has attracted wide research interest owing to its remarkable generalization ability. Nevertheless, existing reinforcement learning methods still face the problem of low data quality, where data samples cannot elicit diverse responses from MLLMs, thus restricting the exploration scope for MLLM reinforcement learning. Some methods attempt to mitigate this problem by imposing constraints on entropy, but none address it at its root. Therefore, to tackle this problem, this work proposes Syn-GRPO (Synthesis-GRPO), which employs an online data generator to synthesize high-quality training data with diverse responses in GRPO training. Specifically, Syn-GRPO consists of two components: (1) data server; (2) GRPO workflow. The data server synthesizes new samples from existing ones using an image generation model, featuring a decoupled and asynchronous scheme to achieve high generation efficiency. The GRPO workflow provides the data server with the new image descriptions, and it leverages a diversity reward to supervise the MLLM to predict image descriptions for synthesizing samples with diverse responses. Experiment results across three visual perception tasks demonstrate that Syn-GRPO improves the data quality by a large margin, achieving significant superior performance to existing MLLM perception methods, and Syn-GRPO presents promising potential for scaling long-term self-evolving RL. Our code is available at https://github.com/hqhQAQ/Syn-GRPO.
☆ POUR: A Provably Optimal Method for Unlearning Representations via Neural Collapse
In computer vision, machine unlearning aims to remove the influence of specific visual concepts or training images without retraining from scratch. Studies show that existing approaches often modify the classifier while leaving internal representations intact, resulting in incomplete forgetting. In this work, we extend the notion of unlearning to the representation level, deriving a three-term interplay between forgetting efficacy, retention fidelity, and class separation. Building on Neural Collapse theory, we show that the orthogonal projection of a simplex Equiangular Tight Frame (ETF) remains an ETF in a lower dimensional space, yielding a provably optimal forgetting operator. We further introduce the Representation Unlearning Score (RUS) to quantify representation-level forgetting and retention fidelity. Building on this, we introduce POUR (Provably Optimal Unlearning of Representations), a geometric projection method with closed-form (POUR-P) and a feature-level unlearning variant under a distillation scheme (POUR-D). Experiments on CIFAR-10/100 and PathMNIST demonstrate that POUR achieves effective unlearning while preserving retained knowledge, outperforming state-of-the-art unlearning methods on both classification-level and representation-level metrics.
☆ MonoMSK: Monocular 3D Musculoskeletal Dynamics Estimation
Reconstructing biomechanically realistic 3D human motion - recovering both kinematics (motion) and kinetics (forces) - is a critical challenge. While marker-based systems are lab-bound and slow, popular monocular methods use oversimplified, anatomically inaccurate models (e.g., SMPL) and ignore physics, fundamentally limiting their biomechanical fidelity. In this work, we introduce MonoMSK, a hybrid framework that bridges data-driven learning and physics-based simulation for biomechanically realistic 3D human motion estimation from monocular video. MonoMSK jointly recovers both kinematics (motions) and kinetics (forces and torques) through an anatomically accurate musculoskeletal model. By integrating transformer-based inverse dynamics with differentiable forward kinematics and dynamics layers governed by ODE-based simulation, MonoMSK establishes a physics-regulated inverse-forward loop that enforces biomechanical causality and physical plausibility. A novel forward-inverse consistency loss further aligns motion reconstruction with the underlying kinetic reasoning. Experiments on BML-MoVi, BEDLAM, and OpenCap show that MonoMSK significantly outperforms state-of-the-art methods in kinematic accuracy, while for the first time enabling precise monocular kinetics estimation.
☆ SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation
Preserving first-frame identity while ensuring precise motion control is a fundamental challenge in human image animation. The Image-to-Motion Binding process of the dominant Reference-to-Video (R2V) paradigm overlooks critical spatio-temporal misalignments common in real-world applications, leading to failures such as identity drift and visual artifacts. We introduce SteadyDancer, an Image-to-Video (I2V) paradigm-based framework that achieves harmonized and coherent animation and is the first to ensure first-frame preservation robustly. Firstly, we propose a Condition-Reconciliation Mechanism to harmonize the two conflicting conditions, enabling precise control without sacrificing fidelity. Secondly, we design Synergistic Pose Modulation Modules to generate an adaptive and coherent pose representation that is highly compatible with the reference image. Finally, we employ a Staged Decoupled-Objective Training Pipeline that hierarchically optimizes the model for motion fidelity, visual quality, and temporal coherence. Experiments demonstrate that SteadyDancer achieves state-of-the-art performance in both appearance fidelity and motion control, while requiring significantly fewer training resources than comparable methods.
comment: 10 pages, with supp
☆ SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis
Hand-Object Interaction (HOI) generation plays a critical role in advancing applications across animation and robotics. Current video-based methods are predominantly single-view, which impedes comprehensive 3D geometry perception and often results in geometric distortions or unrealistic motion patterns. While 3D HOI approaches can generate dynamically plausible motions, their dependence on high-quality 3D data captured in controlled laboratory settings severely limits their generalization to real-world scenarios. To overcome these limitations, we introduce SyncMV4D, the first model that jointly generates synchronized multi-view HOI videos and 4D motions by unifying visual prior, motion dynamics, and multi-view geometry. Our framework features two core innovations: (1) a Multi-view Joint Diffusion (MJD) model that co-generates HOI videos and intermediate motions, and (2) a Diffusion Points Aligner (DPA) that refines the coarse intermediate motion into globally aligned 4D metric point tracks. To tightly couple 2D appearance with 4D dynamics, we establish a closed-loop, mutually enhancing cycle. During the diffusion denoising process, the generated video conditions the refinement of the 4D motion, while the aligned 4D point tracks are reprojected to guide next-step joint generation. Experimentally, our method demonstrates superior performance to state-of-the-art alternatives in visual realism, motion plausibility, and multi-view consistency.
comment: Project Page: https://droliven.github.io/SyncMV4D
☆ Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach
Recent fine-tuning techniques for diffusion models enable them to reproduce specific image sets, such as particular faces or artistic styles, but also introduce copyright and security risks. Dataset watermarking has been proposed to ensure traceability by embedding imperceptible watermarks into training images, which remain detectable in outputs even after fine-tuning. However, current methods lack a unified evaluation framework. To address this, this paper establishes a general threat model and introduces a comprehensive evaluation framework encompassing Universality, Transmissibility, and Robustness. Experiments show that existing methods perform well in universality and transmissibility, and exhibit some robustness against common image processing operations, yet still fall short under real-world threat scenarios. To reveal these vulnerabilities, the paper further proposes a practical watermark removal method that fully eliminates dataset watermarks without affecting fine-tuning, highlighting a key challenge for future research.
☆ Dual-Granularity Semantic Prompting for Language Guidance Infrared Small Target Detection
Infrared small target detection remains challenging due to limited feature representation and severe background interference, resulting in sub-optimal performance. While recent CLIP-inspired methods attempt to leverage textual guidance for detection, they are hindered by inaccurate text descriptions and reliance on manual annotations. To overcome these limitations, we propose DGSPNet, an end-to-end language prompt-driven framework. Our approach integrates dual-granularity semantic prompts: coarse-grained textual priors (e.g., 'infrared image', 'small target') and fine-grained personalized semantic descriptions derived through visual-to-textual mapping within the image space. This design not only facilitates learning fine-grained semantic information but also can inherently leverage language prompts during inference without relying on any annotation requirements. By fully leveraging the precision and conciseness of text descriptions, we further introduce a text-guide channel attention (TGCA) mechanism and text-guide spatial attention (TGSA) mechanism that enhances the model's sensitivity to potential targets across both low- and high-level feature spaces. Extensive experiments demonstrate that our method significantly improves detection accuracy and achieves state-of-the-art performance on three benchmark datasets.
comment: 10 pages, 2 figures
☆ IDEAL-M3D: Instance Diversity-Enriched Active Learning for Monocular 3D Detection
Monocular 3D detection relies on just a single camera and is therefore easy to deploy. Yet, achieving reliable 3D understanding from monocular images requires substantial annotation, and 3D labels are especially costly. To maximize performance under constrained labeling budgets, it is essential to prioritize annotating samples expected to deliver the largest performance gains. This prioritization is the focus of active learning. Curiously, we observed two significant limitations in active learning algorithms for 3D monocular object detection. First, previous approaches select entire images, which is inefficient, as non-informative instances contained in the same image also need to be labeled. Secondly, existing methods rely on uncertainty-based selection, which in monocular 3D object detection creates a bias toward depth ambiguity. Consequently, distant objects are selected, while nearby objects are overlooked. To address these limitations, we propose IDEAL-M3D, the first instance-level pipeline for monocular 3D detection. For the first time, we demonstrate that an explicitly diverse, fast-to-train ensemble improves diversity-driven active learning for monocular 3D. We induce diversity with heterogeneous backbones and task-agnostic features, loss weight perturbation, and time-dependent bagging. IDEAL-M3D shows superior performance and significant resource savings: with just 60% of the annotations, we achieve similar or better AP3D on KITTI validation and test set results compared to training the same detector on the whole dataset.
☆ DensifyBeforehand: LiDAR-assisted Content-aware Densification for Efficient and Quality 3D Gaussian Splatting
This paper addresses the limitations of existing 3D Gaussian Splatting (3DGS) methods, particularly their reliance on adaptive density control, which can lead to floating artifacts and inefficient resource usage. We propose a novel densify beforehand approach that enhances the initialization of 3D scenes by combining sparse LiDAR data with monocular depth estimation from corresponding RGB images. Our ROI-aware sampling scheme prioritizes semantically and geometrically important regions, yielding a dense point cloud that improves visual fidelity and computational efficiency. This densify beforehand approach bypasses the adaptive density control that may introduce redundant Gaussians in the original pipeline, allowing the optimization to focus on the other attributes of 3D Gaussian primitives, reducing overlap while enhancing visual quality. Our method achieves comparable results to state-of-the-art techniques while significantly lowering resource consumption and training time. We validate our approach through extensive comparisons and ablation studies on four newly collected datasets, showcasing its effectiveness in preserving regions of interest in complex scenes.
☆ ReMatch: Boosting Representation through Matching for Multimodal Retrieval
We present ReMatch, a framework that leverages the generative strength of MLLMs for multimodal retrieval. Previous approaches treated an MLLM as a simple encoder, ignoring its generative nature, and under-utilising its compositional reasoning and world knowledge. We instead train the embedding MLLM end-to-end with a chat-style generative matching stage. The matching stage uses the same MLLM to autoregressively decide relevance from multi-view inputs, including both raw data and its own projected embeddings for each query and document. It provides instance-wise discrimination supervision that complements a standard contrastive loss, offering stronger gradients on hard negatives and preserving the compositional strengths of the original MLLM. To obtain semantically richer multimodal embeddings, we use multiple learnable tokens to augment each input, generating fine-grained contextual, mutually orthogonal embeddings with low inference cost. Leveraging our established high-performance baseline,we assemble the ideas mentioned above into a powerful training recipe and achieve a new state-of-the-art on the Massive Multimodal Embedding Benchmark (MMEB). Our experiments show particularly strong zero-shot generalization results on five datasets, highlighting the robustness and transferability of ReMatch.
☆ Diffusion Reconstruction-based Data Likelihood Estimation for Core-Set Selection AAAI 2026
Existing core-set selection methods predominantly rely on heuristic scoring signals such as training dynamics or model uncertainty, lacking explicit modeling of data likelihood. This omission may hinder the constructed subset from capturing subtle yet critical distributional structures that underpin effective model training. In this work, we propose a novel, theoretically grounded approach that leverages diffusion models to estimate data likelihood via reconstruction deviation induced by partial reverse denoising. Specifically, we establish a formal connection between reconstruction error and data likelihood, grounded in the Evidence Lower Bound (ELBO) of Markovian diffusion processes, thereby enabling a principled, distribution-aware scoring criterion for data selection. Complementarily, we introduce an efficient information-theoretic method to identify the optimal reconstruction timestep, ensuring that the deviation provides a reliable signal indicative of underlying data likelihood. Extensive experiments on ImageNet demonstrate that reconstruction deviation offers an effective scoring criterion, consistently outperforming existing baselines across selection ratios, and closely matching full-data training using only 50% of the data. Further analysis shows that the likelihood-informed nature of our score reveals informative insights in data selection, shedding light on the interplay between data distributional characteristics and model learning preferences.
comment: Accepted by AAAI 2026
☆ BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment
Conditional image generation enhances text-to-image synthesis with structural, spatial, or stylistic priors, but current methods face challenges in handling conflicts between sources. These include 1) input-level conflicts, where the conditioning image contradicts the text prompt, and 2) model-bias conflicts, where generative biases disrupt alignment even when conditions match the text. Addressing these conflicts requires nuanced solutions, which standard supervised fine-tuning struggles to provide. Preference-based optimization techniques like Direct Preference Optimization (DPO) show promise but are limited by gradient entanglement between text and condition signals and lack disentangled training data for multi-constraint tasks. To overcome this, we propose a bidirectionally decoupled DPO framework (BideDPO). Our method creates two disentangled preference pairs-one for the condition and one for the text-to reduce gradient entanglement. The influence of pairs is managed using an Adaptive Loss Balancing strategy for balanced optimization. We introduce an automated data pipeline to sample model outputs and generate conflict-aware data. This process is embedded in an iterative optimization strategy that refines both the model and the data. We construct a DualAlign benchmark to evaluate conflict resolution between text and condition. Experiments show BideDPO significantly improves text success rates (e.g., +35%) and condition adherence. We also validate our approach using the COCO dataset. Project Pages: https://limuloo.github.io/BideDPO/.
comment: 29 pages
☆ LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models
Humans can perceive and understand 3D space and long videos from sequential visual observations. But do vision-language models (VLMs) can? Recent work demonstrates that even state-of-the-art VLMs still struggle to understand 3D space and long videos, although they are powerful in typical vision-language tasks. Current methods often rely on specialized architectural designs to improve performance for 3D tasks and video understanding tasks separately. In contrast, we propose LAST, short for LeArn to Think in Space and Time, to jointly improve 3D spatial and long video understanding for general VLMs with only a set of 2D images as inputs. LAST makes VLMs think in space and time rather than only with text before giving the final answer, building visual thinking trajectories in 3D space and temporal dimension. We demonstrate the effectiveness of LAST in two scenarios: 1) zero-shot, where we directly prompt proprietary models; and 2) fine-tuning general VLMs with data that include thinking trajectories in 3D space and time. We show that LAST brings substantial gains in various benchmarks, including 3 spatial understanding, 4 video understanding, and 3 image understanding tasks. Notably, 15.8% gains on EgoSchema with GPT-4o in a zero-shot manner and 8.3 gains on VSI-Bench compared with Qwen2.5-VL-7B.
☆ Adversarial Patch Attacks on Vision-Based Cargo Occupancy Estimation via Differentiable 3D Simulation
Computer vision systems are increasingly adopted in modern logistics operations, including the estimation of trailer occupancy for planning, routing, and billing. Although effective, such systems may be vulnerable to physical adversarial attacks, particularly adversarial patches that can be printed and placed on interior surfaces. In this work, we study the feasibility of such attacks on a convolutional cargo-occupancy classifier using fully simulated 3D environments. Using Mitsuba 3 for differentiable rendering, we optimize patch textures across variations in geometry, lighting, and viewpoint, and compare their effectiveness to a 2D compositing baseline. Our experiments demonstrate that 3D-optimized patches achieve high attack success rates, especially in a denial-of-service scenario (empty to full), where success reaches 84.94 percent. Concealment attacks (full to empty) prove more challenging but still reach 30.32 percent. We analyze the factors influencing attack success, discuss implications for the security of automated logistics pipelines, and highlight directions for strengthening physical robustness. To our knowledge, this is the first study to investigate adversarial patch attacks for cargo-occupancy estimation in physically realistic, fully simulated 3D scenes.
comment: 9 pages, 5 figures, 1 algorithm
☆ FedPoisonTTP: A Threat Model and Poisoning Attack for Federated Test-Time Personalization
Test-time personalization in federated learning enables models at clients to adjust online to local domain shifts, enhancing robustness and personalization in deployment. Yet, existing federated learning work largely overlooks the security risks that arise when local adaptation occurs at test time. Heterogeneous domain arrivals, diverse adaptation algorithms, and limited cross-client visibility create vulnerabilities where compromised participants can craft poisoned inputs and submit adversarial updates that undermine both global and per-client performance. To address this threat, we introduce FedPoisonTTP, a realistic grey-box attack framework that explores test-time data poisoning in the federated adaptation setting. FedPoisonTTP distills a surrogate model from adversarial queries, synthesizes in-distribution poisons using feature-consistency, and optimizes attack objectives to generate high-entropy or class-confident poisons that evade common adaptation filters. These poisons are injected during local adaptation and spread through collaborative updates, leading to broad degradation. Extensive experiments on corrupted vision benchmarks show that compromised participants can substantially diminish overall test-time performance.
comment: 13 pages, 3 figures, 2 tables
☆ IDSplat: Instance-Decomposed 3D Gaussian Splatting for Driving Scenes
Reconstructing dynamic driving scenes is essential for developing autonomous systems through sensor-realistic simulation. Although recent methods achieve high-fidelity reconstructions, they either rely on costly human annotations for object trajectories or use time-varying representations without explicit object-level decomposition, leading to intertwined static and dynamic elements that hinder scene separation. We present IDSplat, a self-supervised 3D Gaussian Splatting framework that reconstructs dynamic scenes with explicit instance decomposition and learnable motion trajectories, without requiring human annotations. Our key insight is to model dynamic objects as coherent instances undergoing rigid transformations, rather than unstructured time-varying primitives. For instance decomposition, we employ zero-shot, language-grounded video tracking anchored to 3D using lidar, and estimate consistent poses via feature correspondences. We introduce a coordinated-turn smoothing scheme to obtain temporally and physically consistent motion trajectories, mitigating pose misalignments and tracking failures, followed by joint optimization of object poses and Gaussian parameters. Experiments on the Waymo Open Dataset demonstrate that our method achieves competitive reconstruction quality while maintaining instance-level decomposition and generalizes across diverse sequences and view densities without retraining, making it practical for large-scale autonomous driving applications. Code will be released.
☆ Learning Plug-and-play Memory for Guiding Video Diffusion Models
Diffusion Transformer(DiT) based video generation models have recently achieved impressive visual quality and temporal coherence, but they still frequently violate basic physical laws and commonsense dynamics, revealing a lack of explicit world knowledge. In this work, we explore how to equip them with a plug-and-play memory that injects useful world knowledge. Motivated by in-context memory in Transformer-based LLMs, we conduct empirical studies to show that DiT can be steered via interventions on its hidden states, and simple low-pass and high-pass filters in the embedding space naturally disentangle low-level appearance and high-level physical/semantic cues, enabling targeted guidance. Building on these observations, we propose a learnable memory encoder DiT-Mem, composed of stacked 3D CNNs, low-/high-pass filters, and self-attention layers. The encoder maps reference videos into a compact set of memory tokens, which are concatenated as the memory within the DiT self-attention layers. During training, we keep the diffusion backbone frozen, and only optimize the memory encoder. It yields a rather efficient training process on few training parameters (150M) and 10K data samples, and enables plug-and-play usage at inference time. Extensive experiments on state-of-the-art models demonstrate the effectiveness of our method in improving physical rule following and video fidelity. Our code and data are publicly released here: https://thrcle421.github.io/DiT-Mem-Web/.
☆ Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving
Autonomous driving heavily relies on accurate and robust spatial perception. Many failures arise from inaccuracies and instability, especially in long-tail scenarios and complex interactions. However, current vision-language models are weak at spatial grounding and understanding, and VLA systems built on them therefore show limited perception and localization ability. To address these challenges, we introduce Percept-WAM, a perception-enhanced World-Awareness-Action Model that is the first to implicitly integrate 2D/3D scene understanding abilities within a single vision-language model (VLM). Instead of relying on QA-style spatial reasoning, Percept-WAM unifies 2D/3D perception tasks into World-PV and World-BEV tokens, which encode both spatial coordinates and confidence. We propose a grid-conditioned prediction mechanism for dense object perception, incorporating IoU-aware scoring and parallel autoregressive decoding, improving stability in long-tail, far-range, and small-object scenarios. Additionally, Percept-WAM leverages pretrained VLM parameters to retain general intelligence (e.g., logical reasoning) and can output perception results and trajectory control outputs directly. Experiments show that Percept-WAM matches or surpasses classical detectors and segmenters on downstream perception benchmarks, achieving 51.7/58.9 mAP on COCO 2D detection and nuScenes BEV 3D detection. When integrated with trajectory decoders, it further improves planning performance on nuScenes and NAVSIM, e.g., surpassing DiffusionDrive by 2.1 in PMDS on NAVSIM. Qualitative results further highlight its strong open-vocabulary and long-tail generalization.
☆ Are Large Vision Language Models Truly Grounded in Medical Images? Evidence from Italian Clinical Visual Question Answering
Large vision language models (VLMs) have achieved impressive performance on medical visual question answering benchmarks, yet their reliance on visual information remains unclear. We investigate whether frontier VLMs demonstrate genuine visual grounding when answering Italian medical questions by testing four state-of-the-art models: Claude Sonnet 4.5, GPT-4o, GPT-5-mini, and Gemini 2.0 flash exp. Using 60 questions from the EuropeMedQA Italian dataset that explicitly require image interpretation, we substitute correct medical images with blank placeholders to test whether models truly integrate visual and textual information. Our results reveal striking variability in visual dependency: GPT-4o shows the strongest visual grounding with a 27.9pp accuracy drop (83.2% [74.6%, 91.7%] to 55.3% [44.1%, 66.6%]), while GPT-5-mini, Gemini, and Claude maintain high accuracy with modest drops of 8.5pp, 2.4pp, and 5.6pp respectively. Analysis of model-generated reasoning reveals confident explanations for fabricated visual interpretations across all models, suggesting varying degrees of reliance on textual shortcuts versus genuine visual analysis. These findings highlight critical differences in model robustness and the need for rigorous evaluation before clinical deployment.
comment: Accepted at the Workshop on Multimodal Representation Learning for Healthcare (MMRL4H), EurIPS 2025
☆ ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment AAAI 2026
Text-to-motion generation, which synthesizes 3D human motions from text inputs, holds immense potential for applications in gaming, film, and robotics. Recently, diffusion-based methods have been shown to generate more diversity and realistic motion. However, there exists a misalignment between text and motion distributions in diffusion models, which leads to semantically inconsistent or low-quality motions. To address this limitation, we propose Reward-guided sampling Alignment (ReAlign), comprising a step-aware reward model to assess alignment quality during the denoising sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Extensive experiments of both motion generation and retrieval tasks demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.
comment: Accepted by AAAI 2026
☆ NVGS: Neural Visibility for Occlusion Culling in 3D Gaussian Splatting
3D Gaussian Splatting can exploit frustum culling and level-of-detail strategies to accelerate rendering of scenes containing a large number of primitives. However, the semi-transparent nature of Gaussians prevents the application of another highly effective technique: occlusion culling. We address this limitation by proposing a novel method to learn the viewpoint-dependent visibility function of all Gaussians in a trained model using a small, shared MLP across instances of an asset in a scene. By querying it for Gaussians within the viewing frustum prior to rasterization, our method can discard occluded primitives during rendering. Leveraging Tensor Cores for efficient computation, we integrate these neural queries directly into a novel instanced software rasterizer. Our approach outperforms the current state of the art for composed scenes in terms of VRAM usage and image quality, utilizing a combination of our instanced rasterizer and occlusion culling MLP, and exhibits complementary properties to existing LoD techniques.
comment: 15 pages, 13 figures
☆ Can Modern Vision Models Understand the Difference Between an Object and a Look-alike?
Recent advances in computer vision have yielded models with strong performance on recognition benchmarks; however, significant gaps remain in comparison to human perception. One subtle ability is to judge whether an image looks like a given object without being an instance of that object. We study whether vision-language models such as CLIP capture this distinction. We curated a dataset named RoLA (Real or Lookalike) of real and lookalike exemplars (e.g., toys, statues, drawings, pareidolia) across multiple categories, and first evaluate a prompt-based baseline with paired "real"/"lookalike" prompts. We then estimate a direction in CLIP's embedding space that moves representations between real and lookalike. Applying this direction to image and text embeddings improves discrimination in cross-modal retrieval on Conceptual12M, and also enhances captions produced by a CLIP prefix captioner.
☆ CLASH: A Benchmark for Cross-Modal Contradiction Detection
Contradictory multimodal inputs are common in real-world settings, yet existing benchmarks typically assume input consistency and fail to evaluate cross-modal contradiction detection - a fundamental capability for preventing hallucinations and ensuring reliability. We introduce CLASH, a novel benchmark for multimodal contradiction detection, featuring COCO images paired with contradictory captions containing controlled object-level or attribute-level contradictions. The samples include targeted questions evaluated in both multiple-choice and open-ended formats. The benchmark provides an extensive fine-tuning set filtered through automated quality checks, alongside a smaller human-verified diagnostic set. Our analysis of state-of-the-art models reveals substantial limitations in recognizing cross-modal conflicts, exposing systematic modality biases and category-specific weaknesses. Furthermore, we empirically demonstrate that targeted fine-tuning on CLASH substantially enhances conflict detection capabilities.
comment: First two authors contributed equally
☆ Three-Dimensional Anatomical Data Generation Based on Artificial Neural Networks IROS
Surgical planning and training based on machine learning requires a large amount of 3D anatomical models reconstructed from medical imaging, which is currently one of the major bottlenecks. Obtaining these data from real patients and during surgery is very demanding, if even possible, due to legal, ethical, and technical challenges. It is especially difficult for soft tissue organs with poor imaging contrast, such as the prostate. To overcome these challenges, we present a novel workflow for automated 3D anatomical data generation using data obtained from physical organ models. We additionally use a 3D Generative Adversarial Network (GAN) to obtain a manifold of 3D models useful for other downstream machine learning tasks that rely on 3D data. We demonstrate our workflow using an artificial prostate model made of biomimetic hydrogels with imaging contrast in multiple zones. This is used to physically simulate endoscopic surgery. For evaluation and 3D data generation, we place it into a customized ultrasound scanner that records the prostate before and after the procedure. A neural network is trained to segment the recorded ultrasound images, which outperforms conventional, non-learning-based computer vision techniques in terms of intersection over union (IoU). Based on the segmentations, a 3D mesh model is reconstructed, and performance feedback is provided.
comment: 6 pages, 4 figures, 1 table, IEEE International Conference on Intelligent Robots and Systems (IROS)
☆ SpectraNet: FFT-assisted Deep Learning Classifier for Deepfake Face Detection
Detecting deepfake images is crucial in combating misinformation. We present a lightweight, generalizable binary classification model based on EfficientNet-B6, fine-tuned with transformation techniques to address severe class imbalances. By leveraging robust preprocessing, oversampling, and optimization strategies, our model achieves high accuracy, stability, and generalization. While incorporating Fourier transform-based phase and amplitude features showed minimal impact, our proposed framework helps non-experts to effectively identify deepfake images, making significant strides toward accessible and reliable deepfake detection.
comment: 4 pages, 3 figures
☆ nnActive: A Framework for Evaluation of Active Learning in 3D Biomedical Segmentation
Semantic segmentation is crucial for various biomedical applications, yet its reliance on large annotated datasets presents a bottleneck due to the high cost and specialized expertise required for manual labeling. Active Learning (AL) aims to mitigate this challenge by querying only the most informative samples, thereby reducing annotation effort. However, in the domain of 3D biomedical imaging, there is no consensus on whether AL consistently outperforms Random sampling. Four evaluation pitfalls hinder the current methodological assessment. These are (1) restriction to too few datasets and annotation budgets, (2) using 2D models on 3D images without partial annotations, (3) Random baseline not being adapted to the task, and (4) measuring annotation cost only in voxels. In this work, we introduce nnActive, an open-source AL framework that overcomes these pitfalls by (1) means of a large scale study spanning four biomedical imaging datasets and three label regimes, (2) extending nnU-Net by using partial annotations for training with 3D patch-based query selection, (3) proposing Foreground Aware Random sampling strategies tackling the foreground-background class imbalance of medical images and (4) propose the foreground efficiency metric, which captures the low annotation cost of background-regions. We reveal the following findings: (A) while all AL methods outperform standard Random sampling, none reliably surpasses an improved Foreground Aware Random sampling; (B) benefits of AL depend on task specific parameters; (C) Predictive Entropy is overall the best performing AL method, but likely requires the most annotation effort; (D) AL performance can be improved with more compute intensive design choices. As a holistic, open-source framework, nnActive can serve as a catalyst for research and application of AL in 3D biomedical imaging. Code is at: https://github.com/MIC-DKFZ/nnActive
comment: Accepted at TMLR
☆ Evaluating Deep Learning and Traditional Approaches Used in Source Camera Identification
One of the most important tasks in computer vision is identifying the device using which the image was taken, useful for facilitating further comprehensive analysis of the image. This paper presents comparative analysis of three techniques used in source camera identification (SCI): Photo Response Non-Uniformity (PRNU), JPEG compression artifact analysis, and convolutional neural networks (CNNs). It evaluates each method in terms of device classification accuracy. Furthermore, the research discusses the possible scientific development needed for the implementation of the methods in real-life scenarios.
comment: 4 figures
☆ MetroGS: Efficient and Stable Reconstruction of Geometrically Accurate High-Fidelity Large-Scale Scenes
Recently, 3D Gaussian Splatting and its derivatives have achieved significant breakthroughs in large-scale scene reconstruction. However, how to efficiently and stably achieve high-quality geometric fidelity remains a core challenge. To address this issue, we introduce MetroGS, a novel Gaussian Splatting framework for efficient and robust reconstruction in complex urban environments. Our method is built upon a distributed 2D Gaussian Splatting representation as the core foundation, serving as a unified backbone for subsequent modules. To handle potential sparse regions in complex scenes, we propose a structured dense enhancement scheme that utilizes SfM priors and a pointmap model to achieve a denser initialization, while incorporating a sparsity compensation mechanism to improve reconstruction completeness. Furthermore, we design a progressive hybrid geometric optimization strategy that organically integrates monocular and multi-view optimization to achieve efficient and accurate geometric refinement. Finally, to address the appearance inconsistency commonly observed in large-scale scenes, we introduce a depth-guided appearance modeling approach that learns spatial features with 3D consistency, facilitating effective decoupling between geometry and appearance and further enhancing reconstruction stability. Experiments on large-scale urban datasets demonstrate that MetroGS achieves superior geometric accuracy, rendering quality, offering a unified solution for high-fidelity large-scale scene reconstruction.
comment: Project page: https://m3phist0.github.io/MetroGS
☆ Test-Time Preference Optimization for Image Restoration AAAI26
Image restoration (IR) models are typically trained to recover high-quality images using L1 or LPIPS loss. To handle diverse unknown degradations, zero-shot IR methods have also been introduced. However, existing pre-trained and zero-shot IR approaches often fail to align with human preferences, resulting in restored images that may not be favored. This highlights the critical need to enhance restoration quality and adapt flexibly to various image restoration tasks or backbones without requiring model retraining and ideally without labor-intensive preference data collection. In this paper, we propose the first Test-Time Preference Optimization (TTPO) paradigm for image restoration, which enhances perceptual quality, generates preference data on-the-fly, and is compatible with any IR model backbone. Specifically, we design a training-free, three-stage pipeline: (i) generate candidate preference images online using diffusion inversion and denoising based on the initially restored image; (ii) select preferred and dispreferred images using automated preference-aligned metrics or human feedback; and (iii) use the selected preference images as reward signals to guide the diffusion denoising process, optimizing the restored image to better align with human preferences. Extensive experiments across various image restoration tasks and models demonstrate the effectiveness and flexibility of the proposed pipeline.
comment: Accepted by AAAI26
☆ From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation
This paper introduces the retrieval-augmented framework for automatic fashion caption and hashtag generation, combining multi-garment detection, attribute reasoning, and Large Language Model (LLM) prompting. The system aims to produce visually grounded, descriptive, and stylistically interesting text for fashion imagery, overcoming the limitations of end-to-end captioners that have problems with attribute fidelity and domain generalization. The pipeline combines a YOLO-based detector for multi-garment localization, k-means clustering for dominant color extraction, and a CLIP-FAISS retrieval module for fabric and gender attribute inference based on a structured product index. These attributes, together with retrieved style examples, create a factual evidence pack that is used to guide an LLM to generate human-like captions and contextually rich hashtags. A fine-tuned BLIP model is used as a supervised baseline model for comparison. Experimental results show that the YOLO detector is able to obtain a mean Average Precision (mAP@0.5) of 0.71 for nine categories of garments. The RAG-LLM pipeline generates expressive attribute-aligned captions and achieves mean attribute coverage of 0.80 with full coverage at the 50% threshold in hashtag generation, whereas BLIP gives higher lexical overlap and lower generalization. The retrieval-augmented approach exhibits better factual grounding, less hallucination, and great potential for scalable deployment in various clothing domains. These results demonstrate the use of retrieval-augmented generation as an effective and interpretable paradigm for automated and visually grounded fashion content generation.
comment: Submitted to Expert Systems with Applications
☆ Collaborative Learning with Multiple Foundation Models for Source-Free Domain Adaptation
Source-Free Domain Adaptation (SFDA) aims to adapt a pre-trained source model to an unlabeled target domain without access to source data. Recent advances in Foundation Models (FMs) have introduced new opportunities for leveraging external semantic knowledge to guide SFDA. However, relying on a single FM is often insufficient, as it tends to bias adaptation toward a restricted semantic coverage, failing to capture diverse contextual cues under domain shift. To overcome this limitation, we propose a Collaborative Multi-foundation Adaptation (CoMA) framework that jointly leverages two different FMs (e.g., CLIP and BLIP) with complementary properties to capture both global semantics and local contextual cues. Specifically, we employ a bidirectional adaptation mechanism that (1) aligns different FMs with the target model for task adaptation while maintaining their semantic distinctiveness, and (2) transfers complementary knowledge from the FMs to the target model. To ensure stable adaptation under mini-batch training, we introduce Decomposed Mutual Information (DMI) that selectively enhances true dependencies while suppressing false dependencies arising from incomplete class coverage. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art SFDA methods across four benchmarks, including Office-31, Office-Home, DomainNet-126, and VisDA, under the closed-set setting, while also achieving best results on partial-set and open-set variants.
comment: 15 pages, 8 figures
☆ ABM-LoRA: Activation Boundary Matching for Fast Convergence in Low-Rank Adaptation
We propose Activation Boundary Matching for Low-Rank Adaptation (ABM-LoRA), a principled initialization strategy that substantially accelerates the convergence of low-rank adapters. While LoRA offers high parameter efficiency, its random initialization restricts gradient updates to a mismatched tangent space, causing significant information loss and hindering early convergence. Our ABM-LoRA addresses this by aligning the adapter's activation boundaries with those of the pretrained model before downstream training, thereby maximizing the projection of full-parameter gradients into the adapter subspace. This alignment sharply reduces information loss at initialization, yields a lower starting loss, and accelerates convergence. We demonstrate ABM-LoRA's effectiveness across diverse architectures and tasks: language understanding (T5-Base on GLUE), dialogue generation (LLaMA2-7B on WizardLM), and vision recognition (ViT-B/16 on VTAB-1K). On VTAB-1K, it achieves the highest accuracy among all methods, with strong gains on structured reasoning tasks requiring geometric understanding.
comment: 16 pages, 5 figures, under review
☆ FilmSceneDesigner: Chaining Set Design for Procedural Film Scene Generation
Film set design plays a pivotal role in cinematic storytelling and shaping the visual atmosphere. However, the traditional process depends on expert-driven manual modeling, which is labor-intensive and time-consuming. To address this issue, we introduce FilmSceneDesigner, an automated scene generation system that emulates professional film set design workflow. Given a natural language description, including scene type, historical period, and style, we design an agent-based chaining framework to generate structured parameters aligned with film set design workflow, guided by prompt strategies that ensure parameter accuracy and coherence. On the other hand, we propose a procedural generation pipeline which executes a series of dedicated functions with the structured parameters for floorplan and structure generation, material assignment, door and window placement, and object retrieval and layout, ultimately constructing a complete film scene from scratch. Moreover, to enhance cinematic realism and asset diversity, we construct SetDepot-Pro, a curated dataset of 6,862 film-specific 3D assets and 733 materials. Experimental results and human evaluations demonstrate that our system produces structurally sound scenes with strong cinematic fidelity, supporting downstream tasks such as virtual previs, construction drawing and mood board creation.
☆ MambaRefine-YOLO: A Dual-Modality Small Object Detector for UAV Imagery
Small object detection in Unmanned Aerial Vehicle (UAV) imagery is a persistent challenge, hindered by low resolution and background clutter. While fusing RGB and infrared (IR) data offers a promising solution, existing methods often struggle with the trade-off between effective cross-modal interaction and computational efficiency. In this letter, we introduce MambaRefine-YOLO. Its core contributions are a Dual-Gated Complementary Mamba fusion module (DGC-MFM) that adaptively balances RGB and IR modalities through illumination-aware and difference-aware gating mechanisms, and a Hierarchical Feature Aggregation Neck (HFAN) that uses a ``refine-then-fuse'' strategy to enhance multi-scale features. Our comprehensive experiments validate this dual-pronged approach. On the dual-modality DroneVehicle dataset, the full model achieves a state-of-the-art mAP of 83.2%, an improvement of 7.9% over the baseline. On the single-modality VisDrone dataset, a variant using only the HFAN also shows significant gains, demonstrating its general applicability. Our work presents a superior balance between accuracy and speed, making it highly suitable for real-world UAV applications.
comment: Submitted to IEEE Geoscience and Remote Sensing Letters
☆ When Semantics Regulate: Rethinking Patch Shuffle and Internal Bias for Generated Image Detection with CLIP
The rapid progress of GANs and Diffusion Models poses new challenges for detecting AI-generated images. Although CLIP-based detectors exhibit promising generalization, they often rely on semantic cues rather than generator artifacts, leading to brittle performance under distribution shifts. In this work, we revisit the nature of semantic bias and uncover that Patch Shuffle provides an unusually strong benefit for CLIP, that disrupts global semantic continuity while preserving local artifact cues, which reduces semantic entropy and homogenizes feature distributions between natural and synthetic images. Through a detailed layer-wise analysis, we further show that CLIP's deep semantic structure functions as a regulator that stabilizes cross-domain representations once semantic bias is suppressed. Guided by these findings, we propose SemAnti, a semantic-antagonistic fine-tuning paradigm that freezes the semantic subspace and adapts only artifact-sensitive layers under shuffled semantics. Despite its simplicity, SemAnti achieves state-of-the-art cross-domain generalization on AIGCDetectBenchmark and GenImage, demonstrating that regulating semantics is key to unlocking CLIP's full potential for robust AI-generated image detection.
comment: 14 pages, 7 figures and 7 tables
☆ MonoSR: Open-Vocabulary Spatial Reasoning from Monocular Images
Spatial reasoning (SR), the ability to infer 3D spatial information from 2D inputs, is essential for real-world applications such as embodied AI and autonomous driving. However, existing research primarily focuses on indoor environments and typically relies on multi-view observations, which limits their generalizability to outdoor scenarios and constrains their applicability to monocular images, the most common real-world setting. In this work, we propose MonoSR, a large-scale monocular spatial reasoning dataset that spans diverse scenarios including indoor, outdoor, and object-centric settings, and supports multiple question types. MonoSR provides a path toward open-world monocular spatial reasoning. Beyond introducing the dataset, we evaluate advanced vision-language models to reveal their limitations on this challenging task. We further analyze whether auxiliary information is crucial for monocular spatial reasoning and offer practical guidance for designing future models. These contributions collectively establish a foundation for advancing monocular spatial reasoning in real-world, open-world environments.
☆ 3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion
The miniaturization of thermal sensors for mobile platforms inherently limits their spatial resolution and textural fidelity, leading to blurry and less informative images. Existing thermal super-resolution (SR) methods can be grouped into single-image and RGB-guided approaches: the former struggles to recover fine structures from limited information, while the latter relies on accurate and laborious cross-camera calibration, which hinders practical deployment and robustness. Here, we propose 3M-TI, a calibration-free Multi-camera cross-Modality diffusion framework for Mobile Thermal Imaging. At its core, 3M-TI integrates a cross-modal self-attention module (CSM) into the diffusion UNet, replacing the original self-attention layers to adaptively align thermal and RGB features throughout the denoising process, without requiring explicit camera calibration. This design enables the diffusion network to leverage its generative prior to enhance spatial resolution, structural fidelity, and texture detail in the super-resolved thermal images. Extensive evaluations on real-world mobile thermal cameras and public benchmarks validate our superior performance, achieving state-of-the-art results in both visual quality and quantitative metrics. More importantly, the thermal images enhanced by 3M-TI lead to substantial gains in critical downstream tasks like object detection and segmentation, underscoring its practical value for robust mobile thermal perception systems. More materials: https://github.com/work-submit/3MTI.
comment: 11 pages, 7 figures
☆ DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection
Diffusion-based editing enables realistic modification of local image regions, making AI-generated content harder to detect. Existing AIGC detection benchmarks focus on classifying entire images, overlooking the localization of diffusion-based edits. We introduce DiffSeg30k, a publicly available dataset of 30k diffusion-edited images with pixel-level annotations, designed to support fine-grained detection. DiffSeg30k features: 1) In-the-wild images--we collect images or image prompts from COCO to reflect real-world content diversity; 2) Diverse diffusion models--local edits using eight SOTA diffusion models; 3) Multi-turn editing--each image undergoes up to three sequential edits to mimic real-world sequential editing; and 4) Realistic editing scenarios--a vision-language model (VLM)-based pipeline automatically identifies meaningful regions and generates context-aware prompts covering additions, removals, and attribute changes. DiffSeg30k shifts AIGC detection from binary classification to semantic segmentation, enabling simultaneous localization of edits and identification of the editing models. We benchmark three baseline segmentation approaches, revealing significant challenges in semantic segmentation tasks, particularly concerning robustness to image distortions. Experiments also reveal that segmentation models, despite being trained for pixel-level localization, emerge as highly reliable whole-image classifiers of diffusion edits, outperforming established forgery classifiers while showing great potential in cross-generator generalization. We believe DiffSeg30k will advance research in fine-grained localization of AI-generated content by demonstrating the promise and limitations of segmentation-based methods. DiffSeg30k is released at: https://huggingface.co/datasets/Chaos2629/Diffseg30k
comment: 16 pages, 10 figures
☆ HABIT: Human Action Benchmark for Interactive Traffic in CARLA WACV 2026
Current autonomous driving (AD) simulations are critically limited by their inadequate representation of realistic and diverse human behavior, which is essential for ensuring safety and reliability. Existing benchmarks often simplify pedestrian interactions, failing to capture complex, dynamic intentions and varied responses critical for robust system deployment. To overcome this, we introduce HABIT (Human Action Benchmark for Interactive Traffic), a high-fidelity simulation benchmark. HABIT integrates real-world human motion, sourced from mocap and videos, into CARLA (Car Learning to Act, a full autonomous driving simulator) via a modular, extensible, and physically consistent motion retargeting pipeline. From an initial pool of approximately 30,000 retargeted motions, we curate 4,730 traffic-compatible pedestrian motions, standardized in SMPL format for physically consistent trajectories. HABIT seamlessly integrates with CARLA's Leaderboard, enabling automated scenario generation and rigorous agent evaluation. Our safety metrics, including Abbreviated Injury Scale (AIS) and False Positive Braking Rate (FPBR), reveal critical failure modes in state-of-the-art AD agents missed by prior evaluations. Evaluating three state-of-the-art autonomous driving agents, InterFuser, TransFuser, and BEVDriver, demonstrates how HABIT exposes planner weaknesses that remain hidden in scripted simulations. Despite achieving close or equal to zero collisions per kilometer on the CARLA Leaderboard, the autonomous agents perform notably worse on HABIT, with up to 7.43 collisions/km and a 12.94% AIS 3+ injury risk, and they brake unnecessarily in up to 33% of cases. All components are publicly released to support reproducible, pedestrian-aware AI research.
comment: Accepted to WACV 2026. This is the pre-camera-ready version
☆ Graph-based 3D Human Pose Estimation using WiFi Signals
WiFi-based human pose estimation (HPE) has attracted increasing attention due to its resilience to occlusion and privacy-preserving compared to camera-based methods. However, existing WiFi-based HPE approaches often employ regression networks that directly map WiFi channel state information (CSI) to 3D joint coordinates, ignoring the inherent topological relationships among human joints. In this paper, we present GraphPose-Fi, a graph-based framework that explicitly models skeletal topology for WiFi-based 3D HPE. Our framework comprises a CNN encoder shared across antennas for subcarrier-time feature extraction, a lightweight attention module that adaptively reweights features over time and across antennas, and a graph-based regression head that combines GCN layers with self-attention to capture local topology and global dependencies. Our proposed method significantly outperforms existing methods on the MM-Fi dataset in various settings. The source code is available at: https://github.com/Cirrick/GraphPose-Fi.
☆ Towards Generalizable Deepfake Detection via Forgery-aware Audio-Visual Adaptation: A Variational Bayesian Approach
The widespread application of AIGC contents has brought not only unprecedented opportunities, but also potential security concerns, e.g., audio-visual deepfakes. Therefore, it is of great importance to develop an effective and generalizable method for multi-modal deepfake detection. Typically, the audio-visual correlation learning could expose subtle cross-modal inconsistencies, e.g., audio-visual misalignment, which serve as crucial clues in deepfake detection. In this paper, we reformulate the correlation learning with variational Bayesian estimation, where audio-visual correlation is approximated as a Gaussian distributed latent variable, and thus develop a novel framework for deepfake detection, i.e., Forgery-aware Audio-Visual Adaptation with Variational Bayes (FoVB). Specifically, given the prior knowledge of pre-trained backbones, we adopt two core designs to estimate audio-visual correlations effectively. First, we exploit various difference convolutions and a high-pass filter to discern local and global forgery traces from both modalities. Second, with the extracted forgery-aware features, we estimate the latent Gaussian variable of audio-visual correlation via variational Bayes. Then, we factorize the variable into modality-specific and correlation-specific ones with orthogonality constraint, allowing them to better learn intra-modal and cross-modal forgery traces with less entanglement. Extensive experiments demonstrate that our FoVB outperforms other state-of-the-art methods in various benchmarks.
comment: TIFS AQE
☆ DEAP-3DSAM: Decoder Enhanced and Auto Prompt SAM for 3D Medical Image Segmentation
The Segment Anything Model (SAM) has recently demonstrated significant potential in medical image segmentation. Although SAM is primarily trained on 2D images, attempts have been made to apply it to 3D medical image segmentation. However, the pseudo 3D processing used to adapt SAM results in spatial feature loss, limiting its performance. Additionally, most SAM-based methods still rely on manual prompts, which are challenging to implement in real-world scenarios and require extensive external expert knowledge. To address these limitations, we introduce the Decoder Enhanced and Auto Prompt SAM (DEAP-3DSAM) to tackle these limitations. Specifically, we propose a Feature Enhanced Decoder that fuses the original image features with rich and detailed spatial information to enhance spatial features. We also design a Dual Attention Prompter to automatically obtain prompt information through Spatial Attention and Channel Attention. We conduct comprehensive experiments on four public abdominal tumor segmentation datasets. The results indicate that our DEAP-3DSAM achieves state-of-the-art performance in 3D image segmentation, outperforming or matching existing manual prompt methods. Furthermore, both quantitative and qualitative ablation studies confirm the effectiveness of our proposed modules.
comment: Accepted by BIBM 2024
☆ DynaMix: Generalizable Person Re-identification via Dynamic Relabeling and Mixed Data Sampling
Generalizable person re-identification (Re-ID) aims to recognize individuals across unseen cameras and environments. While existing methods rely heavily on limited labeled multi-camera data, we propose DynaMix, a novel method that effectively combines manually labeled multi-camera and large-scale pseudo-labeled single-camera data. Unlike prior works, DynaMix dynamically adapts to the structure and noise of the training data through three core components: (1) a Relabeling Module that refines pseudo-labels of single-camera identities on-the-fly; (2) an Efficient Centroids Module that maintains robust identity representations under a large identity space; and (3) a Data Sampling Module that carefully composes mixed data mini-batches to balance learning complexity and intra-batch diversity. All components are specifically designed to operate efficiently at scale, enabling effective training on millions of images and hundreds of thousands of identities. Extensive experiments demonstrate that DynaMix consistently outperforms state-of-the-art methods in generalizable person Re-ID.
☆ Understanding, Accelerating, and Improving MeanFlow Training
MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.
☆ Granular Computing-driven SAM: From Coarse-to-Fine Guidance for Prompt-Free Segmentation
Prompt-free image segmentation aims to generate accurate masks without manual guidance. Typical pre-trained models, notably Segmentation Anything Model (SAM), generate prompts directly at a single granularity level. However, this approach has two limitations: (1) Localizability, lacking mechanisms for autonomous region localization; (2) Scalability, limited fine-grained modeling at high resolution. To address these challenges, we introduce Granular Computing-driven SAM (Grc-SAM), a coarse-to-fine framework motivated by Granular Computing (GrC). First, the coarse stage adaptively extracts high-response regions from features to achieve precise foreground localization and reduce reliance on external prompts. Second, the fine stage applies finer patch partitioning with sparse local swin-style attention to enhance detail modeling and enable high-resolution segmentation. Third, refined masks are encoded as latent prompt embeddings for the SAM decoder, replacing handcrafted prompts with an automated reasoning process. By integrating multi-granularity attention, Grc-SAM bridges granular computing with vision transformers. Extensive experimental results demonstrate Grc-SAM outperforms baseline methods in both accuracy and scalability. It offers a unique granular computational perspective for prompt-free segmentation.
comment: 19 pages, 7 figures
☆ LAA3D: A Benchmark of Detecting and Tracking Low-Altitude Aircraft in 3D Space
Perception of Low-Altitude Aircraft (LAA) in 3D space enables precise 3D object localization and behavior understanding. However, datasets tailored for 3D LAA perception remain scarce. To address this gap, we present LAA3D, a large-scale dataset designed to advance 3D detection and tracking of low-altitude aerial vehicles. LAA3D contains 15,000 real images and 600,000 synthetic frames, captured across diverse scenarios, including urban and suburban environments. It covers multiple aerial object categories, including electric Vertical Take-Off and Landing (eVTOL) aircraft, Micro Aerial Vehicles (MAVs), and Helicopters. Each instance is annotated with 3D bounding box, class label, and instance identity, supporting tasks such as 3D object detection, 3D multi-object tracking (MOT), and 6-DoF pose estimation. Besides, we establish the LAA3D Benchmark, integrating multiple tasks and methods with unified evaluation protocols for comparison. Furthermore, we propose MonoLAA, a monocular 3D detection baseline, achieving robust 3D localization from zoom cameras with varying focal lengths. Models pretrained on synthetic images transfer effectively to real-world data with fine-tuning, demonstrating strong sim-to-real generalization. Our LAA3D provides a comprehensive foundation for future research in low-altitude 3D object perception.
comment: 25 pages
☆ Beyond Reward Margin: Rethinking and Resolving Likelihood Displacement in Diffusion Models via Video Generation
Direct Preference Optimization (DPO) has shown promising results in aligning generative outputs with human preferences by distinguishing between chosen and rejected samples. However, a critical limitation of DPO is likelihood displacement, where the probabilities of chosen samples paradoxically decrease during training, undermining the quality of generation. Although this issue has been investigated in autoregressive models, its impact within diffusion-based models remains largely unexplored. This gap leads to suboptimal performance in tasks involving video generation. To address this, we conduct a formal analysis of DPO loss through updating policy within the diffusion framework, which describes how the updating of specific training samples influences the model's predictions on other samples. Using this tool, we identify two main failure modes: (1) Optimization Conflict, which arises from small reward margins between chosen and rejected samples, and (2) Suboptimal Maximization, caused by large reward margins. Informed by these insights, we introduce a novel solution named Policy-Guided DPO (PG-DPO), combining Adaptive Rejection Scaling (ARS) and Implicit Preference Regularization (IPR) to effectively mitigate likelihood displacement. Experiments show that PG-DPO outperforms existing methods in both quantitative metrics and qualitative evaluations, offering a robust solution for improving preference alignment in video generation tasks.
☆ MedSAM3: Delving into Segment Anything with Medical Concepts
Medical image segmentation is fundamental for biomedical discovery. Existing methods lack generalizability and demand extensive, time-consuming manual annotation for new clinical application. Here, we propose MedSAM-3, a text promptable medical segmentation model for medical image and video segmentation. By fine-tuning the Segment Anything Model (SAM) 3 architecture on medical images paired with semantic conceptual labels, our MedSAM-3 enables medical Promptable Concept Segmentation (PCS), allowing precise targeting of anatomical structures via open-vocabulary text descriptions rather than solely geometric prompts. We further introduce the MedSAM-3 Agent, a framework that integrates Multimodal Large Language Models (MLLMs) to perform complex reasoning and iterative refinement in an agent-in-the-loop workflow. Comprehensive experiments across diverse medical imaging modalities, including X-ray, MRI, Ultrasound, CT, and video, demonstrate that our approach significantly outperforms existing specialist and foundation models. We will release our code and model at https://github.com/Joey-S-Liu/MedSAM3.
☆ CSD: Change Semantic Detection with only Semantic Change Masks for Damage Assessment in Conflict Zones
Accurately and swiftly assessing damage from conflicts is crucial for humanitarian aid and regional stability. In conflict zones, damaged zones often share similar architectural styles, with damage typically covering small areas and exhibiting blurred boundaries. These characteristics lead to limited data, annotation difficulties, and significant recognition challenges, including high intra-class similarity and ambiguous semantic changes. To address these issues, we introduce a pre-trained DINOv3 model and propose a multi-scale cross-attention difference siamese network (MC-DiSNet). The powerful visual representation capability of the DINOv3 backbone enables robust and rich feature extraction from bi-temporal remote sensing images. We also release a new Gaza-change dataset containing high-resolution satellite image pairs from 2023-2024 with pixel-level semantic change annotations. It is worth emphasizing that our annotations only include semantic pixels of changed areas. Unlike conventional semantic change detection (SCD), our approach eliminates the need for large-scale semantic annotations of bi-temporal images, instead focusing directly on the changed regions. We term this new task change semantic detection (CSD). The CSD task represents a direct extension of binary change detection (BCD). Due to the limited spatial extent of semantic regions, it presents greater challenges than traditional SCD tasks. We evaluated our method under the CSD framework on both the Gaza-Change and SECOND datasets. Experimental results demonstrate that our proposed approach effectively addresses the CSD task, and its outstanding performance paves the way for practical applications in rapid damage assessment across conflict zones.
☆ ReEXplore: Improving MLLMs for Embodied Exploration with Contextualized Retrospective Experience Replay
Embodied exploration is a target-driven process that requires embodied agents to possess fine-grained perception and knowledge-enhanced decision making. While recent attempts leverage MLLMs for exploration due to their strong perceptual and reasoning abilities, we find that MLLM-based embodied agents remain suboptimal in exploring new environments: (i) they rely on profound but stale pre-trained knowledge, (ii) training-based approaches such as imitation learning or reinforcement learning are expensive for long-horizon tasks with sparse outcome rewards, and (iii) frontier-based exploration yields a large, visually nuanced action space that is difficult for MLLMs to make reliable decisions. We address these challenges with ReEXplore, a training-free framework that performs retrospective experience replay to inject distilled, abstract experience at inference time, and hierarchical frontier selection to decompose frontier ranking into coarse-to-fine decisions. Our approach enables robust, traceable, and efficient exploration. Across multiple embodied exploration benchmarks, ReEXplore yields great improvements over strong MLLM baselines, up to 3x higher performance in both success rate and in navigation efficiency under open-source backbones.
comment: 8 main pages plus 13 pages Appendix
☆ Benchmarking Corruption Robustness of LVLMs: A Discriminative Benchmark and Robustness Alignment Metric
Despite the remarkable reasoning abilities of large vision-language models (LVLMs), their robustness under visual corruptions remains insufficiently studied. Existing evaluation paradigms exhibit two major limitations: 1) the dominance of low-discriminative samples in current datasets masks the real robustness gap between models; and 2) conventional accuracy-based metric fail to capture the degradation of the underlying prediction structure. To bridge these gaps, we introduce Bench-C, a comprehensive benchmark emphasizing discriminative samples for assessing corruption robustness, where a selection strategy is proposed to jointly consider the prediction inconsistency under corruption and the semantic diversity. Furthermore, we propose the Robustness Alignment Score (RAS), a unified metric that measures degradation in logit-level prediction structure by considering the shifts in prediction uncertainty and calibration alignment. Comprehensive experiments and analysis reveal several interesting findings: 1) model behaviors exhibit distinguish patterns under corruptions, such as erroneous confidence and hesitation; 2) despite subtle corruption may lead to a slight accuracy gain, the overall prediction structure still degrades; 3) by decomposing corruption robustness into destructive and corrective components, the distinct failure and recovery patterns across models can be revealed.
comment: 15 pages
☆ Life-IQA: Boosting Blind Image Quality Assessment through GCN-enhanced Layer Interaction and MoE-based Feature Decoupling
Blind image quality assessment (BIQA) plays a crucial role in evaluating and optimizing visual experience. Most existing BIQA approaches fuse shallow and deep features extracted from backbone networks, while overlooking the unequal contributions to quality prediction. Moreover, while various vision encoder backbones are widely adopted in BIQA, the effective quality decoding architectures remain underexplored. To address these limitations, this paper investigates the contributions of shallow and deep features to BIQA, and proposes a effective quality feature decoding framework via GCN-enhanced \underline{l}ayer\underline{i}nteraction and MoE-based \underline{f}eature d\underline{e}coupling, termed \textbf{(Life-IQA)}. Specifically, the GCN-enhanced layer interaction module utilizes the GCN-enhanced deepest-layer features as query and the penultimate-layer features as key, value, then performs cross-attention to achieve feature interaction. Moreover, a MoE-based feature decoupling module is proposed to decouple fused representations though different experts specialized for specific distortion types or quality dimensions. Extensive experiments demonstrate that Life-IQA shows more favorable balance between accuracy and cost than a vanilla Transformer decoder and achieves state-of-the-art performance on multiple BIQA benchmarks.The code is available at: \href{https://github.com/TANGLONG2/Life-IQA/tree/main}{\texttt{Life-IQA}}.
☆ Dynamic Granularity Matters: Rethinking Vision Transformers Beyond Fixed Patch Splitting
Vision Transformers (ViTs) have demonstrated strong capabilities in capturing global dependencies but often struggle to efficiently represent fine-grained local details. Existing multi-scale approaches alleviate this issue by integrating hierarchical or hybrid features; however, they rely on fixed patch sizes and introduce redundant computation. To address these limitations, we propose Granularity-driven Vision Transformer (Grc-ViT), a dynamic coarse-to-fine framework that adaptively adjusts visual granularity based on image complexity. It comprises two key stages: (1) Coarse Granularity Evaluation module, which assesses visual complexity using edge density, entropy, and frequency-domain cues to estimate suitable patch and window sizes; (2) Fine-grained Refinement module, which refines attention computation according to the selected granularity, enabling efficient and precise feature learning. Two learnable parameters, α and \b{eta}, are optimized end-to-end to balance global reasoning and local perception. Comprehensive evaluations demonstrate that Grc-ViT enhances fine-grained discrimination while achieving a superior trade-off between accuracy and computational efficiency.
comment: 10 pages, 7 figures
☆ A Self-Conditioned Representation Guided Diffusion Model for Realistic Text-to-LiDAR Scene Generation
Text-to-LiDAR generation can customize 3D data with rich structures and diverse scenes for downstream tasks. However, the scarcity of Text-LiDAR pairs often causes insufficient training priors, generating overly smooth 3D scenes. Moreover, low-quality text descriptions may degrade generation quality and controllability. In this paper, we propose a Text-to-LiDAR Diffusion Model for scene generation, named T2LDM, with a Self-Conditioned Representation Guidance (SCRG). Specifically, SCRG, by aligning to the real representations, provides the soft supervision with reconstruction details for the Denoising Network (DN) in training, while decoupled in inference. In this way, T2LDM can perceive rich geometric structures from data distribution, generating detailed objects in scenes. Meanwhile, we construct a content-composable Text-LiDAR benchmark, T2nuScenes, along with a controllability metric. Based on this, we analyze the effects of different text prompts for LiDAR generation quality and controllability, providing practical prompt paradigms and insights. Furthermore, a directional position prior is designed to mitigate street distortion, further improving scene fidelity. Additionally, by learning a conditional encoder via frozen DN, T2LDM can support multiple conditional tasks, including Sparse-to-Dense, Dense-to-Sparse, and Semantic-to-LiDAR generation. Extensive experiments in unconditional and conditional generation demonstrate that T2LDM outperforms existing methods, achieving state-of-the-art scene generation.
☆ AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization WACV 2026
With the rapid advancement of sophisticated synthetic audio-visual content, e.g., for subtle malicious manipulations, ensuring the integrity of digital media has become paramount. This work presents a novel approach to temporal localization of deepfakes by leveraging Audio-Visual Speech Representation Reconstruction (AuViRe). Specifically, our approach reconstructs speech representations from one modality (e.g., lip movements) based on the other (e.g., audio waveform). Cross-modal reconstruction is significantly more challenging in manipulated video segments, leading to amplified discrepancies, thereby providing robust discriminative cues for precise temporal forgery localization. AuViRe outperforms the state of the art by +8.9 AP@0.95 on LAV-DF, +9.6 AP@0.5 on AV-Deepfake1M, and +5.1 AUC on an in-the-wild experiment. Code available at https://github.com/mever-team/auvire.
comment: WACV 2026
☆ View-Consistent Diffusion Representations for 3D-Consistent Video Generation
Video generation models have made significant progress in generating realistic content, enabling applications in simulation, gaming, and film making. However, current generated videos still contain visual artifacts arising from 3D inconsistencies, e.g., objects and structures deforming under changes in camera pose, which can undermine user experience and simulation fidelity. Motivated by recent findings on representation alignment for diffusion models, we hypothesize that improving the multi-view consistency of video diffusion representations will yield more 3D-consistent video generation. Through detailed analysis on multiple recent camera-controlled video diffusion models we reveal strong correlations between 3D-consistent representations and videos. We also propose ViCoDR, a new approach for improving the 3D consistency of video models by learning multi-view consistent diffusion representations. We evaluate ViCoDR on camera controlled image-to-video, text-to-video, and multi-view generation models, demonstrating significant improvements in the 3D consistency of the generated videos. Project page: https://danier97.github.io/ViCoDR.
☆ Rethinking Plant Disease Diagnosis: Bridging the Academic-Practical Gap with Vision Transformers and Zero-Shot Learning
Recent advances in deep learning have enabled significant progress in plant disease classification using leaf images. Much of the existing research in this field has relied on the PlantVillage dataset, which consists of well-centered plant images captured against uniform, uncluttered backgrounds. Although models trained on this dataset achieve high accuracy, they often fail to generalize to real-world field images, such as those submitted by farmers to plant diagnostic systems. This has created a significant gap between published studies and practical application requirements, highlighting the necessity of investigating and addressing this issue. In this study, we investigate whether attention-based architectures and zero-shot learning approaches can bridge the gap between curated academic datasets and real-world agricultural conditions in plant disease classification. We evaluate three model categories: Convolutional Neural Networks (CNNs), Vision Transformers, and Contrastive Language-Image Pre-training (CLIP)-based zero-shot models. While CNNs exhibit limited robustness under domain shift, Vision Transformers demonstrate stronger generalization by capturing global contextual features. Most notably, CLIP models classify diseases directly from natural language descriptions without any task-specific training, offering strong adaptability and interpretability. These findings highlight the potential of zero-shot learning as a practical and scalable domain adaptation strategy for plant health diagnosis in diverse field environments.
☆ UMCL: Unimodal-generated Multimodal Contrastive Learning for Cross-compression-rate Deepfake Detection
In deepfake detection, the varying degrees of compression employed by social media platforms pose significant challenges for model generalization and reliability. Although existing methods have progressed from single-modal to multimodal approaches, they face critical limitations: single-modal methods struggle with feature degradation under data compression in social media streaming, while multimodal approaches require expensive data collection and labeling and suffer from inconsistent modal quality or accessibility in real-world scenarios. To address these challenges, we propose a novel Unimodal-generated Multimodal Contrastive Learning (UMCL) framework for robust cross-compression-rate (CCR) deepfake detection. In the training stage, our approach transforms a single visual modality into three complementary features: compression-robust rPPG signals, temporal landmark dynamics, and semantic embeddings from pre-trained vision-language models. These features are explicitly aligned through an affinity-driven semantic alignment (ASA) strategy, which models inter-modal relationships through affinity matrices and optimizes their consistency through contrastive learning. Subsequently, our cross-quality similarity learning (CQSL) strategy enhances feature robustness across compression rates. Extensive experiments demonstrate that our method achieves superior performance across various compression rates and manipulation types, establishing a new benchmark for robust deepfake detection. Notably, our approach maintains high detection accuracy even when individual features degrade, while providing interpretable insights into feature relationships through explicit alignment.
comment: 24-page manuscript accepted to IJCV
☆ Zero-shot segmentation of skin tumors in whole-slide images with vision-language foundation models
Accurate annotation of cutaneous neoplasm biopsies represents a major challenge due to their wide morphological variability, overlapping histological patterns, and the subtle distinctions between benign and malignant lesions. Vision-language foundation models (VLMs), pre-trained on paired image-text corpora, learn joint representations that bridge visual features and diagnostic terminology, enabling zero-shot localization and classification of tissue regions without pixel-level labels. However, most existing VLM applications in histopathology remain limited to slide-level tasks or rely on coarse interactive prompts, and they struggle to produce fine-grained segmentations across gigapixel whole-slide images (WSIs). In this work, we introduce a zero-shot visual-language segmentation pipeline for whole-slide images (ZEUS), a fully automated, zero-shot segmentation framework that leverages class-specific textual prompt ensembles and frozen VLM encoders to generate high-resolution tumor masks in WSIs. By partitioning each WSI into overlapping patches, extracting visual embeddings, and computing cosine similarities against text prompts, we generate a final segmentation mask. We demonstrate competitive performance on two in-house datasets, primary spindle cell neoplasms and cutaneous metastases, highlighting the influence of prompt design, domain shifts, and institutional variability in VLMs for histopathology. ZEUS markedly reduces annotation burden while offering scalable, explainable tumor delineation for downstream diagnostic workflows.
comment: Conference manuscript accepted for oral presentation at CASEIB 2025
☆ Peregrine: One-Shot Fine-Tuning for FHE Inference of General Deep CNNs
We address two fundamental challenges in adapting general deep CNNs for FHE-based inference: approximating non-linear activations such as ReLU with low-degree polynomials while minimizing accuracy degradation, and overcoming the ciphertext capacity barrier that constrains high-resolution image processing on FHE inference. Our contributions are twofold: (1) a single-stage fine-tuning (SFT) strategy that directly converts pre-trained CNNs into FHE-friendly forms using low-degree polynomials, achieving competitive accuracy with minimal training overhead; and (2) a generalized interleaved packing (GIP) scheme that is compatible with feature maps of virtually arbitrary spatial resolutions, accompanied by a suite of carefully designed homomorphic operators that preserve the GIP-form encryption throughout computation. These advances enable efficient, end-to-end FHE inference across diverse CNN architectures. Experiments on CIFAR-10, ImageNet, and MS COCO demonstrate that the FHE-friendly CNNs obtained via our SFT strategy achieve accuracy comparable to baselines using ReLU or SiLU activations. Moreover, this work presents the first demonstration of FHE-based inference for YOLO architectures in object detection leveraging low-degree polynomial activations.
☆ CataractCompDetect: Intraoperative Complication Detection in Cataract Surgery
Cataract surgery is one of the most commonly performed surgeries worldwide, yet intraoperative complications such as iris prolapse, posterior capsule rupture (PCR), and vitreous loss remain major causes of adverse outcomes. Automated detection of such events could enable early warning systems and objective training feedback. In this work, we propose CataractCompDetect, a complication detection framework that combines phase-aware localization, SAM 2-based tracking, complication-specific risk scoring, and vision-language reasoning for final classification. To validate CataractCompDetect, we curate CataComp, the first cataract surgery video dataset annotated for intraoperative complications, comprising 53 surgeries, including 23 with clinical complications. On CataComp, CataractCompDetect achieves an average F1 score of 70.63%, with per-complication performance of 81.8% (Iris Prolapse), 60.87% (PCR), and 69.23% (Vitreous Loss). These results highlight the value of combining structured surgical priors with vision-language reasoning for recognizing rare but high-impact intraoperative events. Our dataset and code will be publicly released upon acceptance.
☆ AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention
Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in embodied AI tasks. However, existing VLA models, often built upon Vision-Language Models (VLMs), typically process dense visual inputs independently at each timestep. This approach implicitly models the task as a Markov Decision Process (MDP). However, this history-agnostic design is suboptimal for effective visual token processing in dynamic sequential decision-making, as it fails to leverage the context of history. To address this limitation, we reformulate the problem from a Partially Observable Markov Decision Process (POMDP) perspective and propose a novel framework named AVA-VLA. Inspired by the POMDP that the action generation should be conditioned on the belief state. AVA-VLA introduces Active Visual Attention (AVA) to dynamically modulate visual processing. It achieves this by leveraging the recurrent state, which is a neural approximation of the agent's belief state derived from the previous decision step. Specifically, the AVA module uses the recurrent state to compute the soft weights to actively process task-relevant visual tokens based on its historical context. Comprehensive evaluations demonstrate that AVA-VLA achieves state-of-the-art performance across popular robotic benchmarks, including LIBERO and CALVIN. Furthermore, real-world deployments on a dual-arm robot platform validate the framework's practical applicability and robust sim-to-real transferability.
comment: 18 pages, 10 figures
☆ Eevee: Towards Close-up High-resolution Video-based Virtual Try-on
Video virtual try-on technology provides a cost-effective solution for creating marketing videos in fashion e-commerce. However, its practical adoption is hindered by two critical limitations. First, the reliance on a single garment image as input in current virtual try-on datasets limits the accurate capture of realistic texture details. Second, most existing methods focus solely on generating full-shot virtual try-on videos, neglecting the business's demand for videos that also provide detailed close-ups. To address these challenges, we introduce a high-resolution dataset for video-based virtual try-on. This dataset offers two key features. First, it provides more detailed information on the garments, which includes high-fidelity images with detailed close-ups and textual descriptions; Second, it uniquely includes full-shot and close-up try-on videos of real human models. Furthermore, accurately assessing consistency becomes significantly more critical for the close-up videos, which demand high-fidelity preservation of garment details. To facilitate such fine-grained evaluation, we propose a new garment consistency metric VGID (Video Garment Inception Distance) that quantifies the preservation of both texture and structure. Our experiments validate these contributions. We demonstrate that by utilizing the detailed images from our dataset, existing video generation models can extract and incorporate texture features, significantly enhancing the realism and detail fidelity of virtual try-on results. Furthermore, we conduct a comprehensive benchmark of recent models. The benchmark effectively identifies the texture and structural preservation problems among current methods.
☆ Compressor-VLA: Instruction-Guided Visual Token Compression for Efficient Robotic Manipulation
Vision-Language-Action (VLA) models have emerged as a powerful paradigm in Embodied AI. However, the significant computational overhead of processing redundant visual tokens remains a critical bottleneck for real-time robotic deployment. While standard token pruning techniques can alleviate this, these task-agnostic methods struggle to preserve task-critical visual information. To address this challenge, simultaneously preserving both the holistic context and fine-grained details for precise action, we propose Compressor-VLA, a novel hybrid instruction-conditioned token compression framework designed for efficient, task-oriented compression of visual information in VLA models. The proposed Compressor-VLA framework consists of two token compression modules: a Semantic Task Compressor (STC) that distills holistic, task-relevant context, and a Spatial Refinement Compressor (SRC) that preserves fine-grained spatial details. This compression is dynamically modulated by the natural language instruction, allowing for the adaptive condensation of task-relevant visual information. Experimentally, extensive evaluations demonstrate that Compressor-VLA achieves a competitive success rate on the LIBERO benchmark while reducing FLOPs by 59% and the visual token count by over 3x compared to its baseline. The real-robot deployments on a dual-arm robot platform validate the model's sim-to-real transferability and practical applicability. Moreover, qualitative analyses reveal that our instruction guidance dynamically steers the model's perceptual focus toward task-relevant objects, thereby validating the effectiveness of our approach.
comment: 11 pages, 5 figures
☆ Leveraging Adversarial Learning for Pathological Fidelity in Virtual Staining
In addition to evaluating tumor morphology using H&E staining, immunohistochemistry is used to assess the presence of specific proteins within the tissue. However, this is a costly and labor-intensive technique, for which virtual staining, as an image-to-image translation task, offers a promising alternative. Although recent, this is an emerging field of research with 64% of published studies just in 2024. Most studies use publicly available datasets of H&E-IHC pairs from consecutive tissue sections. Recognizing the training challenges, many authors develop complex virtual staining models based on conditional Generative Adversarial Networks, but ignore the impact of adversarial loss on the quality of virtual staining. Furthermore, overlooking the issues of model evaluation, they claim improved performance based on metrics such as SSIM and PSNR, which are not sufficiently robust to evaluate the quality of virtually stained images. In this paper, we developed CSSP2P GAN, which we demonstrate to achieve heightened pathological fidelity through a blind pathological expert evaluation. Furthermore, while iteratively developing our model, we study the impact of the adversarial loss and demonstrate its crucial role in the quality of virtually stained images. Finally, while comparing our model with reference works in the field, we underscore the limitations of the currently used evaluation metrics and demonstrate the superior performance of CSSP2P GAN.
☆ VeCoR - Velocity Contrastive Regularization for Flow Matching
Flow Matching (FM) has recently emerged as a principled and efficient alternative to diffusion models. Standard FM encourages the learned velocity field to follow a target direction; however, it may accumulate errors along the trajectory and drive samples off the data manifold, leading to perceptual degradation, especially in lightweight or low-step configurations. To enhance stability and generalization, we extend FM into a balanced attract-repel scheme that provides explicit guidance on both "where to go" and "where not to go." To be formal, we propose \textbf{Velocity Contrastive Regularization (VeCoR)}, a complementary training scheme for flow-based generative modeling that augments the standard FM objective with contrastive, two-sided supervision. VeCoR not only aligns the predicted velocity with a stable reference direction (positive supervision) but also pushes it away from inconsistent, off-manifold directions (negative supervision). This contrastive formulation transforms FM from a purely attractive, one-sided objective into a two-sided training signal, regularizing trajectory evolution and improving perceptual fidelity across datasets and backbones. On ImageNet-1K 256$\times$256, VeCoR yields 22\% and 35\% relative FID reductions on SiT-XL/2 and REPA-SiT-XL/2 backbones, respectively, and achieves further FID gains (32\% relative) on MS-COCO text-to-image generation, demonstrating consistent improvements in stability, convergence, and image quality, particularly in low-step and lightweight settings. Project page: https://p458732.github.io/VeCoR_Project_Page/
☆ Human-Centric Open-Future Task Discovery: Formulation, Benchmark, and Scalable Tree-Based Search
Recent progress in robotics and embodied AI is largely driven by Large Multimodal Models (LMMs). However, a key challenge remains underexplored: how can we advance LMMs to discover tasks that directly assist humans in open-future scenarios, where human intentions are highly concurrent and dynamic. In this work, we formalize the problem of Human-centric Open-future Task Discovery (HOTD), focusing particularly on identifying tasks that reduce human effort across multiple plausible futures. To facilitate this study, we propose an HOTD-Bench, which features over 2K real-world videos, a semi-automated annotation pipeline, and a simulation-based protocol tailored for open-set future evaluation. Additionally, we propose the Collaborative Multi-Agent Search Tree (CMAST) framework, which decomposes the complex reasoning through a multi-agent system and structures the reasoning process through a scalable search tree module. In our experiments, CMAST achieves the best performance on the HOTD-Bench, significantly surpassing existing LMMs. It also integrates well with existing LMMs, consistently improving performance.
comment: 10 pages, 9 figures
☆ FineXtrol: Controllable Motion Generation via Fine-Grained Text AAAI 2026
Recent works have sought to enhance the controllability and precision of text-driven motion generation. Some approaches leverage large language models (LLMs) to produce more detailed texts, while others incorporate global 3D coordinate sequences as additional control signals. However, the former often introduces misaligned details and lacks explicit temporal cues, and the latter incurs significant computational cost when converting coordinates to standard motion representations. To address these issues, we propose FineXtrol, a novel control framework for efficient motion generation guided by temporally-aware, precise, user-friendly, and fine-grained textual control signals that describe specific body part movements over time. In support of this framework, we design a hierarchical contrastive learning module that encourages the text encoder to produce more discriminative embeddings for our novel control signals, thereby improving motion controllability. Quantitative results show that FineXtrol achieves strong performance in controllable motion generation, while qualitative analysis demonstrates its flexibility in directing specific body part movements.
comment: 20 pages, 14 figures, AAAI 2026
☆ AttenDence: Maximizing Attention Confidence for Test Time Adaptation
Test-time adaptation (TTA) enables models to adapt to distribution shifts at inference time. While entropy minimization over the output distribution has proven effective for TTA, transformers offer an additional unsupervised learning signal through their attention mechanisms. We propose minimizing the entropy of attention distributions from the CLS token to image patches as a novel TTA objective.This approach encourages the model to attend more confidently to relevant image regions under distribution shift and is effective even when only a single test image is available. We demonstrate that attention entropy minimization improves robustness across diverse corruption types while not hurting performance on clean data on a single sample stream of images at test time.
comment: Initial submission. 5 pages, 4 figures
☆ One4D: Unified 4D Generation and Reconstruction via Decoupled LoRA Control
We present One4D, a unified framework for 4D generation and reconstruction that produces dynamic 4D content as synchronized RGB frames and pointmaps. By consistently handling varying sparsities of conditioning frames through a Unified Masked Conditioning (UMC) mechanism, One4D can seamlessly transition between 4D generation from a single image, 4D reconstruction from a full video, and mixed generation and reconstruction from sparse frames. Our framework adapts a powerful video generation model for joint RGB and pointmap generation, with carefully designed network architectures. The commonly used diffusion finetuning strategies for depthmap or pointmap reconstruction often fail on joint RGB and pointmap generation, quickly degrading the base video model. To address this challenge, we introduce Decoupled LoRA Control (DLC), which employs two modality-specific LoRA adapters to form decoupled computation branches for RGB frames and pointmaps, connected by lightweight, zero-initialized control links that gradually learn mutual pixel-level consistency. Trained on a mixture of synthetic and real 4D datasets under modest computational budgets, One4D produces high-quality RGB frames and accurate pointmaps across both generation and reconstruction tasks. This work represents a step toward general, high-quality geometry-based 4D world modeling using video diffusion models. Project page: https://mizhenxing.github.io/One4D
comment: Project page: https://mizhenxing.github.io/One4D
☆ BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models
Backdoor attacks undermine the reliability and trustworthiness of machine learning systems by injecting hidden behaviors that can be maliciously activated at inference time. While such threats have been extensively studied in unimodal settings, their impact on multimodal foundation models, particularly vision-language models (VLMs), remains largely underexplored. In this work, we introduce \textbf{BackdoorVLM}, the first comprehensive benchmark for systematically evaluating backdoor attacks on VLMs across a broad range of settings. It adopts a unified perspective that injects and analyzes backdoors across core vision-language tasks, including image captioning and visual question answering. BackdoorVLM organizes multimodal backdoor threats into 5 representative categories: targeted refusal, malicious injection, jailbreak, concept substitution, and perceptual hijack. Each category captures a distinct pathway through which an adversary can manipulate a model's behavior. We evaluate these threats using 12 representative attack methods spanning text, image, and bimodal triggers, tested on 2 open-source VLMs and 3 multimodal datasets. Our analysis reveals that VLMs exhibit strong sensitivity to textual instructions, and in bimodal backdoors the text trigger typically overwhelms the image trigger when forming the backdoor mapping. Notably, backdoors involving the textual modality remain highly potent, with poisoning rates as low as 1\% yielding over 90\% success across most tasks. These findings highlight significant, previously underexplored vulnerabilities in current VLMs. We hope that BackdoorVLM can serve as a useful benchmark for analyzing and mitigating multimodal backdoor threats. Code is available at: https://github.com/bin015/BackdoorVLM .
☆ EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models
Video large language models have demonstrated strong video understanding capabilities but suffer from high inference costs due to the massive number of tokens in long videos. Inspired by event-based vision, we propose an event-guided, training-free framework for efficient spatio-temporal understanding, named EventSTU. In the temporal domain, we design a coarse-to-fine keyframe sampling algorithm that exploits the change-triggered property of event cameras to eliminate redundant frames. In the spatial domain, we design an adaptive token pruning algorithm that leverages the visual saliency of events as a zero-cost prior to guide spatial reduction. From a holistic spatio-temporal perspective, we further integrate question relevance from keyframe sampling to adaptively allocate token pruning budgets. To facilitate evaluation, we construct EventBench, the first event-inclusive, human-annotated multimodal benchmark that covers diverse real-world scenarios. Beyond physical event cameras, EventSTU also supports general video understanding using simulated events. Comprehensive experiments show that EventSTU achieves 3.01x FLOPs reduction and 3.10x prefilling speedup over the strongest baseline while still improving performance.
comment: 8 pages, 7 figures
☆ Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation
Group Relative Policy Optimization (GRPO) has emerged as an effective and lightweight framework for post-training visual generative models. However, its performance is fundamentally limited by the ambiguity of textual visual correspondence: a single prompt may validly describe diverse visual outputs, and a single image or video may support multiple equally correct interpretations. This many to many relationship leads reward models to generate uncertain and weakly discriminative signals, causing GRPO to underutilize reliable feedback and overfit noisy ones. We introduce Bayesian Prior-Guided Optimization (BPGO), a novel extension of GRPO that explicitly models reward uncertainty through a semantic prior anchor. BPGO adaptively modulates optimization trust at two levels: inter-group Bayesian trust allocation emphasizes updates from groups consistent with the prior while down-weighting ambiguous ones, and intra-group prior-anchored renormalization sharpens sample distinctions by expanding confident deviations and compressing uncertain scores. Across both image and video generation tasks, BPGO delivers consistently stronger semantic alignment, enhanced perceptual fidelity, and faster convergence than standard GRPO and recent variants.
☆ MatMart: Material Reconstruction of 3D Objects via Diffusion
Applying diffusion models to physically-based material estimation and generation has recently gained prominence. In this paper, we propose \ttt, a novel material reconstruction framework for 3D objects, offering the following advantages. First, \ttt\ adopts a two-stage reconstruction, starting with accurate material prediction from inputs and followed by prior-guided material generation for unobserved views, yielding high-fidelity results. Second, by utilizing progressive inference alongside the proposed view-material cross-attention (VMCA), \ttt\ enables reconstruction from an arbitrary number of input images, demonstrating strong scalability and flexibility. Finally, \ttt\ achieves both material prediction and generation capabilities through end-to-end optimization of a single diffusion model, without relying on additional pre-trained models, thereby exhibiting enhanced stability across various types of objects. Extensive experiments demonstrate that \ttt\ achieves superior performance in material reconstruction compared to existing methods.
☆ MetaDCSeg: Robust Medical Image Segmentation via Meta Dynamic Center Weighting
Medical image segmentation is crucial for clinical applications, but it is frequently disrupted by noisy annotations and ambiguous anatomical boundaries, which lead to instability in model training. Existing methods typically rely on global noise assumptions or confidence-based sample selection, which inadequately mitigate the performance degradation caused by annotation noise, especially in challenging boundary regions. To address this issue, we propose MetaDCSeg, a robust framework that dynamically learns optimal pixel-wise weights to suppress the influence of noisy ground-truth labels while preserving reliable annotations. By explicitly modeling boundary uncertainty through a Dynamic Center Distance (DCD) mechanism, our approach utilizes weighted feature distances for foreground, background, and boundary centers, directing the model's attention toward hard-to-segment pixels near ambiguous boundaries. This strategy enables more precise handling of structural boundaries, which are often overlooked by existing methods, and significantly enhances segmentation performance. Extensive experiments across four benchmark datasets with varying noise levels demonstrate that MetaDCSeg consistently outperforms existing state-of-the-art methods.
☆ MFmamba: A Multi-function Network for Panchromatic Image Resolution Restoration Based on State-Space Model AAAI-2026
Remote sensing images are becoming increasingly widespread in military, earth resource exploration. Because of the limitation of a single sensor, we can obtain high spatial resolution grayscale panchromatic (PAN) images and low spatial resolution color multispectral (MS) images. Therefore, an important issue is to obtain a color image with high spatial resolution when there is only a PAN image at the input. The existing methods improve spatial resolution using super-resolution (SR) technology and spectral recovery using colorization technology. However, the SR technique cannot improve the spectral resolution, and the colorization technique cannot improve the spatial resolution. Moreover, the pansharpening method needs two registered inputs and can not achieve SR. As a result, an integrated approach is expected. To solve the above problems, we designed a novel multi-function model (MFmamba) to realize the tasks of SR, spectral recovery, joint SR and spectral recovery through three different inputs. Firstly, MFmamba utilizes UNet++ as the backbone, and a Mamba Upsample Block (MUB) is combined with UNet++. Secondly, a Dual Pool Attention (DPA) is designed to replace the skip connection in UNet++. Finally, a Multi-scale Hybrid Cross Block (MHCB) is proposed for initial feature extraction. Many experiments show that MFmamba is competitive in evaluation metrics and visual results and performs well in the three tasks when only the input PAN image is used.
comment: 9 pages, 9 figures. This paper has been accepted for publication in AAAI-2026
☆ MagicWorld: Interactive Geometry-driven Video World Exploration
Recent interactive video world model methods generate scene evolution conditioned on user instructions. Although they achieve impressive results, two key limitations remain. First, they fail to fully exploit the correspondence between instruction-driven scene motion and the underlying 3D geometry, which results in structural instability under viewpoint changes. Second, they easily forget historical information during multi-step interaction, resulting in error accumulation and progressive drift in scene semantics and structure. To address these issues, we propose MagicWorld, an interactive video world model that integrates 3D geometric priors and historical retrieval. MagicWorld starts from a single scene image, employs user actions to drive dynamic scene evolution, and autoregressively synthesizes continuous scenes. We introduce the Action-Guided 3D Geometry Module (AG3D), which constructs a point cloud from the first frame of each interaction and the corresponding action, providing explicit geometric constraints for viewpoint transitions and thereby improving structural consistency. We further propose History Cache Retrieval (HCR) mechanism, which retrieves relevant historical frames during generation and injects them as conditioning signals, helping the model utilize past scene information and mitigate error accumulation. Experimental results demonstrate that MagicWorld achieves notable improvements in scene stability and continuity across interaction iterations.
☆ Facade Segmentation for Solar Photovoltaic Suitability NeurIPS 2025
Building integrated photovoltaic (BIPV) facades represent a promising pathway towards urban decarbonization, especially where roof areas are insufficient and ground-mounted arrays are infeasible. Although machine learning-based approaches to support photovoltaic (PV) planning on rooftops are well researched, automated approaches for facades still remain scarce and oversimplified. This paper therefore presents a pipeline that integrates detailed information on the architectural composition of the facade to automatically identify suitable surfaces for PV application and estimate the solar energy potential. The pipeline fine-tunes SegFormer-B5 on the CMP Facades dataset and converts semantic predictions into facade-level PV suitability masks and PV panel layouts considering module sizes and clearances. Applied to a dataset of 373 facades with known dimensions from ten cities, the results show that installable BIPV potential is significantly lower than theoretical potential, thus providing valuable insights for reliable urban energy planning. With the growing availability of facade imagery, the proposed pipeline can be scaled to support BIPV planning in cities worldwide.
comment: NeurIPS 2025 Tackling Climate Change with Machine Learning Workshop version. Non-archival
☆ Parallel Vision Token Scheduling for Fast and Accurate Multimodal LMMs Inference
Multimodal large language models (MLLMs) deliver impressive vision-language reasoning but suffer steep inference latency because self-attention scales quadratically with sequence length and thousands of visual tokens contributed by high-resolution images. Naively pruning less-informative visual tokens reduces this burden, yet indiscriminate removal can strip away contextual cues essential for background or fine-grained questions, undermining accuracy. In this paper, we present ParVTS (Parallel Vision Token Scheduling), a training-free scheduling framework that partitions visual tokens into subject and non-subject groups, processes them in parallel to transfer their semantics into question tokens, and discards the non-subject path mid-inference to reduce computation. This scheduling reduces computational complexity, requires no heuristics or additional modules, and is compatible with diverse existing MLLM architectures. Experiments across multiple MLLM backbones show that ParVTS prunes up to 88.9% of visual tokens with minimal performance drop, achieving 1.77x speedup and 70% FLOPs reduction.
☆ GContextFormer: A global context-aware hybrid multi-head attention approach with scaled additive aggregation for multimodal trajectory prediction
Multimodal trajectory prediction generates multiple plausible future trajectories to address vehicle motion uncertainty from intention ambiguity and execution variability. However, HD map-dependent models suffer from costly data acquisition, delayed updates, and vulnerability to corrupted inputs, causing prediction failures. Map-free approaches lack global context, with pairwise attention over-amplifying straight patterns while suppressing transitional patterns, resulting in motion-intention misalignment. This paper proposes GContextFormer, a plug-and-play encoder-decoder architecture with global context-aware hybrid attention and scaled additive aggregation achieving intention-aligned multimodal prediction without map reliance. The Motion-Aware Encoder builds scene-level intention prior via bounded scaled additive aggregation over mode-embedded trajectory tokens and refines per-mode representations under shared global context, mitigating inter-mode suppression and promoting intention alignment. The Hierarchical Interaction Decoder decomposes social reasoning into dual-pathway cross-attention: a standard pathway ensures uniform geometric coverage over agent-mode pairs while a neighbor-context-enhanced pathway emphasizes salient interactions, with gating module mediating their contributions to maintain coverage-focus balance. Experiments on eight highway-ramp scenarios from TOD-VT dataset show GContextFormer outperforms state-of-the-art baselines. Compared to existing transformer models, GContextFormer achieves greater robustness and concentrated improvements in high-curvature and transition zones via spatial distributions. Interpretability is achieved through motion mode distinctions and neighbor context modulation exposing reasoning attribution. The modular architecture supports extensibility toward cross-domain multimodal reasoning tasks. Source: https://fenghy-chen.github.io/sources/.
☆ Neural Texture Splatting: Expressive 3D Gaussian Splatting for View Synthesis, Geometry, and Dynamic Reconstruction SIGGRAPH
3D Gaussian Splatting (3DGS) has emerged as a leading approach for high-quality novel view synthesis, with numerous variants extending its applicability to a broad spectrum of 3D and 4D scene reconstruction tasks. Despite its success, the representational capacity of 3DGS remains limited by the use of 3D Gaussian kernels to model local variations. Recent works have proposed to augment 3DGS with additional per-primitive capacity, such as per-splat textures, to enhance its expressiveness. However, these per-splat texture approaches primarily target dense novel view synthesis with a reduced number of Gaussian primitives, and their effectiveness tends to diminish when applied to more general reconstruction scenarios. In this paper, we aim to achieve concrete performance improvement over state-of-the-art 3DGS variants across a wide range of reconstruction tasks, including novel view synthesis, geometry and dynamic reconstruction, under both sparse and dense input settings. To this end, we introduce Neural Texture Splatting (NTS). At the core of our approach is a global neural field (represented as a hybrid of a tri-plane and a neural decoder) that predicts local appearance and geometric fields for each primitive. By leveraging this shared global representation that models local texture fields across primitives, we significantly reduce model size and facilitate efficient global information exchange, demonstrating strong generalization across tasks. Furthermore, our neural modeling of local texture fields introduces expressive view- and time-dependent effects, a critical aspect that existing methods fail to account for. Extensive experiments show that Neural Texture Splatting consistently improves models and achieves state-of-the-art results across multiple benchmarks.
comment: SIGGRAPH Asia 2025 (conference track), Project page: https://19reborn.github.io/nts/
☆ HunyuanVideo 1.5 Technical Report
We present HunyuanVideo 1.5, a lightweight yet powerful open-source video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture featuring selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions.Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source video generation models. By releasing the code and model weights, we provide the community with a high-performance foundation that lowers the barrier to video creation and research, making advanced video generation accessible to a broader audience. All open-source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.
☆ DualGazeNet: A Biologically Inspired Dual-Gaze Query Network for Salient Object Detection
Recent salient object detection (SOD) methods aim to improve performance in four key directions: semantic enhancement, boundary refinement, auxiliary task supervision, and multi-modal fusion. In pursuit of continuous gains, these approaches have evolved toward increasingly sophisticated architectures with multi-stage pipelines, specialized fusion modules, edge-guided learning, and elaborate attention mechanisms. However, this complexity paradoxically introduces feature redundancy and cross-component interference that obscure salient cues, ultimately reaching performance bottlenecks. In contrast, human vision achieves efficient salient object identification without such architectural complexity. This contrast raises a fundamental question: can we design a biologically grounded yet architecturally simple SOD framework that dispenses with most of this engineering complexity, while achieving state-of-the-art accuracy, computational efficiency, and interpretability? In this work, we answer this question affirmatively by introducing DualGazeNet, a biologically inspired pure Transformer framework that models the dual biological principles of robust representation learning and magnocellular-parvocellular dual-pathway processing with cortical attention modulation in the human visual system. Extensive experiments on five RGB SOD benchmarks show that DualGazeNet consistently surpasses 25 state-of-the-art CNN- and Transformer-based methods. On average, DualGazeNet achieves about 60\% higher inference speed and 53.4\% fewer FLOPs than four Transformer-based baselines of similar capacity (VST++, MDSAM, Sam2unet, and BiRefNet). Moreover, DualGazeNet exhibits strong cross-domain generalization, achieving leading or highly competitive performance on camouflaged and underwater SOD benchmarks without relying on additional modalities.
♻ ☆ SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation
We introduce SketchDeco, a training-free approach to sketch colourisation that bridges the gap between professional design needs and intuitive, region-based control. Our method empowers artists to use simple masks and colour palettes for precise spatial and chromatic specification, avoiding both the tediousness of manual assignment and the ambiguity of text-based prompts. We reformulate this task as a novel, training-free composition problem. Our core technical contribution is a guided latent-space blending process: we first leverage diffusion inversion to precisely ``paint'' user-defined colours into specified regions, and then use a custom self-attention mechanism to harmoniously blend these local edits with a globally consistent base image. This ensures both local colour fidelity and global harmony without requiring any model fine-tuning. Our system produces high-quality results in 15--20 inference steps on consumer GPUs, making professional-quality, controllable colourisation accessible.
comment: Project Page: \url{https://chaitron.github.io/SketchDeco/}
♻ ☆ The Geometry of Cortical Computation: Manifold Disentanglement and Predictive Dynamics in VCNet NeurIPS 2025
Despite their success, modern convolutional neural networks (CNNs) exhibit fundamental limitations, including data inefficiency, poor out-of-distribution generalization, and vulnerability to adversarial perturbations. These shortcomings can be traced to a lack of inductive biases that reflect the inherent geometric structure of the visual world. The primate visual system, in contrast, demonstrates superior efficiency and robustness, suggesting that its architectural and computational principles,which evolved to internalize these structures,may offer a blueprint for more capable artificial vision. This paper introduces Visual Cortex Network (VCNet), a novel neural network architecture whose design is informed by the macro-scale organization of the primate visual cortex. VCNet is framed as a geometric framework that emulates key biological mechanisms, including hierarchical processing across distinct cortical areas, dual-stream information segregation for learning disentangled representations, and top-down predictive feedback for representation refinement. We interpret these mechanisms through the lens of geometry and dynamical systems, positing that they guide the learning of structured, low-dimensional neural manifolds. We evaluate VCNet on two specialized benchmarks: the Spots-10 animal pattern dataset, which probes sensitivity to natural textures, and a light field image classification task, which requires processing higher-dimensional visual data. Our results show that VCNet achieves state-of-the-art accuracy of 92.1\% on Spots-10 and 74.4\% on the light field dataset, surpassing contemporary models of comparable size. This work demonstrates that integrating high-level neuroscientific principles, viewed through a geometric lens, can lead to more efficient and robust models, providing a promising direction for addressing long-standing challenges in machine learning.
comment: Published in the proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Symmetry and Geometry in Neural Representations (NeurReps). Additionally accepted for presentation in NeurIPS 2025 Workshop: Interpreting Cognition in Deep Learning Models (CogInterp)
♻ ☆ A Target-based Multi-LiDAR Multi-Camera Extrinsic Calibration System
Extrinsic Calibration represents the cornerstone of autonomous driving. Its accuracy plays a crucial role in the perception pipeline, as any errors can have implications for the safety of the vehicle. Modern sensor systems collect different types of data from the environment, making it harder to align the data. To this end, we propose a target-based extrinsic calibration system tailored for a multi-LiDAR and multi-camera sensor suite. This system enables cross-calibration between LiDARs and cameras with limited prior knowledge using a custom ChArUco board and a tailored nonlinear optimization method. We test the system with real-world data gathered in a warehouse. Results demonstrated the effectiveness of the proposed method, highlighting the feasibility of a unique pipeline tailored for various types of sensors.
comment: RiTA 2025 Accepted, 13 Pages, 6 Figures and 2 Tables
♻ ☆ The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification
Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity - leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in ~46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multianimal tracking in the wild. The dataset is available at https://www.conservationxlabs.com/sa-fari.
♻ ☆ FOCUS: Efficient Keyframe Selection for Long Video Understanding
Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region. On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs. Code is available at https://github.com/NUS-HPC-AI-Lab/FOCUS.
♻ ☆ InfoScale: Unleashing Training-free Variable-scaled Image Generation via Effective Utilization of Information
Diffusion models (DMs) have become dominant in visual generation but suffer performance drop when tested on resolutions that differ from the training scale, whether lower or higher. In fact, the key challenge in generating variable-scale images lies in the differing amounts of information across resolutions, which requires information conversion procedures to be varied for generating variable-scaled images. In this paper, we investigate the issues of three critical aspects in DMs for a unified analysis in variable-scaled generation: dilated convolution, attention mechanisms, and initial noise. Specifically, 1) dilated convolution in DMs for the higher-resolution generation loses high-frequency information. 2) Attention for variable-scaled image generation struggles to adjust the information aggregation adaptively. 3) The spatial distribution of information in the initial noise is misaligned with variable-scaled image. To solve the above problems, we propose \textbf{InfoScale}, an information-centric framework for variable-scaled image generation by effectively utilizing information from three aspects correspondingly. For information loss in 1), we introduce Progressive Frequency Compensation module to compensate for high-frequency information lost by dilated convolution in higher-resolution generation. For information aggregation inflexibility in 2), we introduce Adaptive Information Aggregation module to adaptively aggregate information in lower-resolution generation and achieve an effective balance between local and global information in higher-resolution generation. For information distribution misalignment in 3), we design Noise Adaptation module to re-distribute information in initial noise for variable-scaled generation. Our method is plug-and-play for DMs and extensive experiments demonstrate the effectiveness in variable-scaled image generation.
♻ ☆ VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval
Prevailing joint prediction transformers for Video Highlight Detection and Moment Retrieval (HD/MR) exhibit deficiencies in handling cross-task dynamics, achieving robust video-text alignment, and utilizing effective attention mechanisms, with the potential of Large Language/Vision-Language Models (LLMs/LVLMs) being largely untapped. This paper introduces VideoLights, a novel HD/MR framework addressing these limitations by incorporating: (i) Convolutional Projection and Feature Refinement modules with an alignment loss for enhanced video-text feature congruity; (ii) a Bi-Directional Cross-Modal Fusion network for strongly coupled query-aware representations; (iii) a Uni-directional joint-task feedback mechanism for synergistic task improvement; (iv) hard positive/negative losses for adaptive learning; and (v) the leveraging of LVLMs (e.g., BLIP-2) for superior multimodal feature integration and intelligent pre-training with synthetic data. Comprehensive evaluations on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate that VideoLights significantly surpasses existing baselines, establishing new state-of-the-art performances. Codes and model checkpoints are available at https://github.com/dpaul06/VideoLights .
♻ ☆ Optimization-Free Style Transfer for 3D Gaussian Splats
The task of style transfer for 3D Gaussian splats has been explored in many previous works, but these require reconstructing or fine-tuning the splat while incorporating style information or optimizing a feature extraction network on the splat representation. We propose a reconstruction- and optimization-free approach to stylizing 3D Gaussian splats, allowing for direct stylization on a .ply or .splat file without requiring the original camera views. This is done by generating a graph structure across the implicit surface of the splat representation. A feed-forward, surface-based stylization method is then used and interpolated back to the individual splats in the scene. This also allows for fast stylization of splats with no additional training, achieving speeds under 2 minutes even on CPU-based consumer hardware. We demonstrate the quality results this approach achieves and compare to other 3D Gaussian splat style transfer methods. Code is publicly available at https://github.com/davidmhart/FastSplatStyler.
♻ ☆ Zero-Shot Coreset Selection via Iterative Subspace Sampling WACV 2026
Deep learning increasingly relies on massive data with substantial storage, annotation, and training costs. To reduce costs, coreset selection finds a representative subset of data to train models while ideally performing on par with the full data training. To maximize performance, current state-of-the-art coreset methods select data using dataset-specific ground truth labels and training. However, these methodological requirements prevent selection at scale on real-world, unlabeled data. To that end, this paper addresses the selection of coresets that achieve state-of-the-art performance but without using any labels or training on candidate data. Instead, our solution, Zero-Shot Coreset Selection via Iterative Subspace Sampling (ZCore), uses previously-trained foundation models to generate zero-shot, high-dimensional embedding spaces to interpret unlabeled data. ZCore then iteratively quantifies the relative value of all candidate data based on coverage and redundancy in numerous subspace distributions. Finally, ZCore selects a coreset sized for any data budget to train downstream models. We evaluate ZCore on four datasets and outperform several state-of-the-art label-based methods, especially at low data rates that provide the most substantial cost reduction. On ImageNet, ZCore selections for 10% training data achieve a downstream validation accuracy of 53.99%, which outperforms prior label-based methods and removes annotation and training costs for 1.15 million images. Our paper's code is publicly available at https://github.com/voxel51/zcore.
comment: WACV 2026
♻ ☆ Fairness in Multi-modal Medical Diagnosis with Demonstration Selection
Multimodal large language models (MLLMs) have shown strong potential for medical image reasoning, yet fairness across demographic groups remains a major concern. Existing debiasing methods often rely on large labeled datasets or fine-tuning, which are impractical for foundation-scale models. We explore In-Context Learning (ICL) as a lightweight, tuning-free alternative for improving fairness. Through systematic analysis, we find that conventional demonstration selection (DS) strategies fail to ensure fairness due to demographic imbalance in selected exemplars. To address this, we propose Fairness-Aware Demonstration Selection (FADS), which builds demographically balanced and semantically relevant demonstrations via clustering-based sampling. Experiments on multiple medical imaging benchmarks show that FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining strong accuracy, offering an efficient and scalable path toward fair medical image reasoning. These results highlight the potential of fairness-aware in-context learning as a scalable and data-efficient solution for equitable medical image reasoning.
comment: 10 pages (including 2 pages of references), 4 figures. This work explores fairness in multi-modal medical image reasoning using in-context learning
♻ ☆ Minimax Multi-Target Conformal Prediction with Applications to Imaging Inverse Problems
In ill-posed imaging inverse problems, uncertainty quantification remains a fundamental challenge, especially in safety-critical applications. Recently, conformal prediction has been used to quantify the uncertainty that the inverse problem contributes to downstream tasks like image classification, image quality assessment, fat mass quantification, etc. While existing works handle only a scalar estimation target, practical applications often involve multiple targets. In response, we propose an asymptotically minimax approach to multi-target conformal prediction that provides tight prediction intervals while ensuring joint marginal coverage. We then outline how our minimax approach can be applied to multi-metric blind image quality assessment, multi-task uncertainty quantification, and multi-round measurement acquisition. Finally, we numerically demonstrate the benefits of our minimax method, relative to existing multi-target conformal prediction methods, using both synthetic and magnetic resonance imaging (MRI) data. Code is available at https://github.com/jwen307/multi_target_minimax.
♻ ☆ Multiview point cloud registration with anisotropic and space-varying localization noise
In this paper, we address the problem of registering multiple point clouds corrupted with high anisotropic localization noise. Our approach follows the widely used framework of Gaussian mixture model (GMM) reconstruction with an expectation-maximization (EM) algorithm. Existing methods are based on an implicit assumption of space-invariant isotropic Gaussian noise. However, this assumption is violated in practice in applications such as single molecule localization microscopy (SMLM). To address this issue, we propose to introduce an explicit localization noise model that decouples shape modeling with the GMM from noise handling. We design a stochastic EM algorithm that considers noise-free data as a latent variable, with closed-form solutions at each EM step. The first advantage of our approach is to handle space-variant and anisotropic Gaussian noise with arbitrary covariances. The second advantage is to leverage the explicit noise model to impose prior knowledge about the noise that may be available from physical sensors. We show on various simulated data that our noise handling strategy improves significantly the robustness to high levels of anisotropic noise. We also demonstrate the performance of our method on real SMLM data.
♻ ☆ Automatic Multi-View X-Ray/CT Registration Using Bone Substructure Contours
Purpose: Accurate intraoperative X-ray/CT registration is essential for surgical navigation in orthopedic procedures. However, existing methods struggle with consistently achieving sub-millimeter accuracy, robustness under broad initial pose estimates or need manual key-point annotations. This work aims to address these challenges by proposing a novel multi-view X-ray/CT registration method for intraoperative bone registration. Methods: The proposed registration method consists of a multi-view, contour-based iterative closest point (ICP) optimization. Unlike previous methods, which attempt to match bone contours across the entire silhouette in both imaging modalities, we focus on matching specific subcategories of contours corresponding to bone substructures. This leads to reduced ambiguity in the ICP matches, resulting in a more robust and accurate registration solution. This approach requires only two X-ray images and operates fully automatically. Additionally, we contribute a dataset of 5 cadaveric specimens, including real X-ray images, X-ray image poses and the corresponding CT scans. Results: The proposed registration method is evaluated on real X-ray images using mean reprojection error (mRPD). The method consistently achieves sub-millimeter accuracy with a mRPD 0.67mm compared to 5.35mm by a commercial solution requiring manual intervention. Furthermore, the method offers improved practical applicability, being fully automatic. Conclusion: Our method offers a practical, accurate, and efficient solution for multi-view X-ray/CT registration in orthopedic surgeries, which can be easily combined with tracking systems. By improving registration accuracy and minimizing manual intervention, it enhances intraoperative navigation, contributing to more accurate and effective surgical outcomes in computer-assisted surgery (CAS).
comment: This paper was accepted to IPCAI 2025. The Project Webpage is: https://rflepp.github.io/BoneSubstructureContours2D3DRegistration/
♻ ☆ MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis in Chest X-Ray
Recent vision-language foundation models deliver state-of-the-art results in natural image classification, but falter in medical images due to pronounced domain shifts. Training a medical foundation model also requires substantial resources, including extensive annotated data and high computational capacity. To bridge this gap with minimal overhead, we introduce MedBridge, a lightweight multimodal adaptation framework that flexibly re-purposes arbitrary pre-trained foundation VLMs for medical image diagnosis. MedBridge comprises three novel core components. First, a Focal Sampling module that subsamples and extracts high-resolution local regions to capture subtle pathological features, compensating for the limited input resolution of foundation VLMs. Second, a Query-Encoder model with a small set of learnable queries to align the feature maps of frozen VLMs with medical semantics, without requiring retraining of the backbone layers. Third, a Mixture of Experts mechanism, driven by learnable queries, harnesses the complementary strength of various VLMs to maximize diagnostic performance. We evaluate MedBridge on five chest radiograph benchmarks in three key adaptation tasks, demonstrating its superior performance in both cross-domain and in-domain adaptation settings under varying levels of training data availability. MedBridge achieved an improvement of 6-15% in AUC compared to state-of-the-art VLM adaptation methods in multi-label thoracic disease diagnosis, underscoring its effectiveness in leveraging diverse foundation models for accurate and data-efficient medical diagnosis. Our project and code are available at https://github.com/ai-med/MedBridge.
♻ ☆ CUPID: Generative 3D Reconstruction via Joint Object and Pose Modeling
We introduce Cupid, a generative 3D reconstruction framework that jointly models the full distribution over both canonical objects and camera poses. Our two-stage flow-based model first generates a coarse 3D structure and 2D-3D correspondences to estimate the camera pose robustly. Conditioned on this pose, a refinement stage injects pixel-aligned image features directly into the generative process, marrying the rich prior of a generative model with the geometric fidelity of reconstruction. This strategy achieves exceptional faithfulness, outperforming state-of-the-art reconstruction methods by over 3 dB PSNR and 10% in Chamfer Distance. As a unified generative model that decouples the object and camera pose, Cupid naturally extends to multi-view and scene-level reconstruction tasks without requiring post-hoc optimization or fine-tuning.
comment: project page at https://cupid3d.github.io
♻ ☆ Don't Reach for the Stars: Rethinking Topology for Resilient Federated Learning
Federated learning (FL) enables collaborative model training across distributed clients while preserving data privacy by keeping data local. Traditional FL approaches rely on a centralized, star-shaped topology, where a central server aggregates model updates from clients. However, this architecture introduces several limitations, including a single point of failure, limited personalization, and poor robustness to distribution shifts or vulnerability to malfunctioning clients. Moreover, update selection in centralized FL often relies on low-level parameter differences, which can be unreliable when client data is not independent and identically distributed, and offer clients little control. In this work, we propose a decentralized, peer-to-peer (P2P) FL framework. It leverages the flexibility of the P2P topology to enable each client to identify and aggregate a personalized set of trustworthy and beneficial updates.This framework is the Local Inference Guided Aggregation for Heterogeneous Training Environments to Yield Enhancement Through Agreement and Regularization (LIGHTYEAR). Central to our method is an agreement score, computed on a local validation set, which quantifies the semantic alignment of incoming updates in the function space with respect to the clients reference model. Each client uses this score to select a tailored subset of updates and performs aggregation with a regularization term that further stabilizes the training. Our empirical evaluation across five datasets shows that the proposed approach consistently outperforms both, centralized baselines and existing P2P methods in terms of client-level performance, particularly under adversarial and heterogeneous conditions.
♻ ☆ Benchmarking the Spatial Robustness of DNNs via Natural and Adversarial Localized Corruptions
The robustness of deep neural networks is a crucial factor in safety-critical applications, particularly in complex and dynamic environments (e.g., medical or driving scenarios) where localized corruptions can arise. While previous studies have evaluated the robustness of semantic segmentation (SS) models under whole-image natural or adversarial corruptions, a comprehensive investigation into the spatial robustness of dense vision models under localized corruptions remains underexplored. This paper fills this gap by introducing novel, region-aware metrics for benchmarking the spatial robustness of segmentation models, along with an evaluation framework to assess the impact of natural localized corruptions. Furthermore, it uncovers the inherent complexity of evaluating worst-case spatial robustness using only a single localized adversarial attack. To address this, the work proposes a region-aware multi-attack adversarial analysis to systematically assess model robustness across specific image regions. The proposed metrics and analysis were exploited to evaluate 14 segmentation models in driving scenarios, uncovering key insights into the effects of localized corruption in both natural and adversarial forms. The results reveal that models respond to these two types of threats differently; for instance, transformer-based segmentation models demonstrate notable robustness to localized natural corruptions but are highly vulnerable to adversarial ones, and vice versa for CNN-based models. Consequently, we also address the challenge of balancing robustness to both natural and adversarial localized corruptions by means of ensemble models, thereby achieving a broader threat coverage and improved reliability for dense vision tasks.
comment: Accepted for publication in Pattern Recognition
♻ ☆ In-Situ Tweedie Discrete Diffusion Models
While diffusion models excel at generating continuous data such as images, adapting them to discrete tasks has relied on indirect approaches that either operate in continuous embedding spaces or use token masking mechanisms, both of which deviate from modeling the true discrete data distribution that can be theoretically guaranteed by Tweedie's formula. We propose in-situ Tweedie Discrete Diffusion (TDD), a framework that performs diffusion guaranteed by Tweedie's formula directly within the discrete one-hot space, hence "in-situ." Unlike prior methods that diffuse continuous embeddings or mask tokens, TDD directly corrupts one-hot vectors with Gaussian noise and performs iterative denoising through a timestep-conditioned cross-entropy objective rather than mean-squared-error reconstruction. At each denoising step, the model predicts class probabilities, applies argmax to obtain discrete predictions, converts them to one-hot vectors, and feeds them into the next iteration with progressively reduced noise. This process naturally unifies discriminative classification and generative modeling under a single framework. Experiments demonstrate that TDD achieves strong performance on both image classification and text generation tasks, with extensive ablation studies confirming the effectiveness of each design component. Our work establishes a principled approach to discrete diffusion that preserves the core characteristics of diffusion models while operating natively in discrete space.
♻ ☆ ReefNet: A Large scale, Taxonomically Enriched Dataset and Benchmark for Hard Coral Classification
Coral reefs are rapidly declining due to anthropogenic pressures such as climate change, underscoring the urgent need for scalable, automated monitoring. We introduce ReefNet, a large public coral reef image dataset with point-label annotations mapped to the World Register of Marine Species (WoRMS). ReefNet aggregates imagery from 76 curated CoralNet sources and an additional site from Al Wajh in the Red Sea, totaling approximately 925000 genus-level hard coral annotations with expert-verified labels. Unlike prior datasets, which are often limited by size, geography, or coarse labels and are not ML-ready, ReefNet offers fine-grained, taxonomically mapped labels at a global scale to WoRMS. We propose two evaluation settings: (i) a within-source benchmark that partitions each source's images for localized evaluation, and (ii) a cross-source benchmark that withholds entire sources to test domain generalization. We analyze both supervised and zero-shot classification performance on ReefNet and find that while supervised within-source performance is promising, supervised performance drops sharply across domains, and performance is low across the board for zero-shot models, especially for rare and visually similar genera. This provides a challenging benchmark intended to catalyze advances in domain generalization and fine-grained coral classification. We will release our dataset, benchmarking code, and pretrained models to advance robust, domain-adaptive, global coral reef monitoring and conservation.
♻ ☆ U-REPA: Aligning Diffusion U-Nets to ViTs
Representation Alignment (REPA) that aligns Diffusion Transformer (DiT) hidden-states with ViT visual encoders has proven highly effective in DiT training, demonstrating superior convergence properties, but it has not been validated on the canonical diffusion U-Net architecture that shows faster convergence compared to DiTs. However, adapting REPA to U-Net architectures presents unique challenges: (1) different block functionalities necessitate revised alignment strategies; (2) spatial-dimension inconsistencies emerge from U-Net's spatial downsampling operations; (3) space gaps between U-Net and ViT hinder the effectiveness of tokenwise alignment. To encounter these challenges, we propose \textbf{U-REPA}, a representation alignment paradigm that bridges U-Net hidden states and ViT features as follows: Firstly, we propose via observation that due to skip connection, the middle stage of U-Net is the best alignment option. Secondly, we propose upsampling of U-Net features after passing them through MLPs. Thirdly, we observe difficulty when performing tokenwise similarity alignment, and further introduces a manifold loss that regularizes the relative similarity between samples. Experiments indicate that the resulting U-REPA could achieve excellent generation quality and greatly accelerates the convergence speed. With CFG guidance interval, U-REPA could reach $FID<1.5$ in 200 epochs or 1M iterations on ImageNet 256 $\times$ 256, and needs only half the total epochs to perform better than REPA under sd-vae-ft-ema. Codes: https://github.com/YuchuanTian/U-REPA
comment: 22 pages, 7 figures
♻ ☆ InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior
Video inverse problems are fundamental to streaming, telepresence, and AR/VR, where high perceptual quality must coexist with tight latency constraints. Diffusion-based priors currently deliver state-of-the-art reconstructions, but existing approaches either adapt image diffusion models with ad hoc temporal regularizers - leading to temporal artifacts - or rely on native video diffusion models whose iterative posterior sampling is far too slow for real-time use. We introduce InstantViR, an amortized inference framework for ultra-fast video reconstruction powered by a pre-trained video diffusion prior. We distill a powerful bidirectional video diffusion model (teacher) into a causal autoregressive student that maps a degraded video directly to its restored version in a single forward pass, inheriting the teacher's strong temporal modeling while completely removing iterative test-time optimization. The distillation is prior-driven: it only requires the teacher diffusion model and known degradation operators, and does not rely on externally paired clean/noisy video data. To further boost throughput, we replace the video-diffusion backbone VAE with a high-efficiency LeanVAE via an innovative teacher-space regularized distillation scheme, enabling low-latency latent-space processing. Across streaming random inpainting, Gaussian deblurring and super-resolution, InstantViR matches or surpasses the reconstruction quality of diffusion-based baselines while running at over 35 FPS on NVIDIA A100 GPUs, achieving up to 100 times speedups over iterative video diffusion solvers. These results show that diffusion-based video reconstruction is compatible with real-time, interactive, editable, streaming scenarios, turning high-quality video restoration into a practical component of modern vision systems.
♻ ☆ Splats in Splats: Robust and Effective 3D Steganography towards Gaussian Splatting AAAI 2026
3D Gaussian splatting (3DGS) has demonstrated impressive 3D reconstruction performance with explicit scene representations. Given the widespread application of 3DGS in 3D reconstruction and generation tasks, there is an urgent need to protect the copyright of 3DGS assets. However, existing copyright protection techniques for 3DGS overlook the usability of 3D assets, posing challenges for practical deployment. Here we describe splats in splats, the first 3DGS steganography framework that embeds 3D content in 3DGS itself without modifying any attributes. To achieve this, we take a deep insight into spherical harmonics (SH) and devise an importance-graded SH coefficient encryption strategy to embed the hidden SH coefficients. Furthermore, we employ a convolutional autoencoder to establish a mapping between the original Gaussian primitives' opacity and the hidden Gaussian primitives' opacity. Extensive experiments indicate that our method significantly outperforms existing 3D steganography techniques, with 5.31% higher scene fidelity and 3x faster rendering speed, while ensuring security, robustness, and user experience.
comment: Accepted by AAAI 2026
♻ ☆ HiGFA: Hierarchical Guidance for Fine-grained Data Augmentation with Diffusion Models
Generative diffusion models show promise for data augmentation. However, applying them to fine-grained tasks presents a significant challenge: ensuring synthetic images accurately capture the subtle, category-defining features critical for high fidelity. Standard approaches, such as text-based Classifier-Free Guidance (CFG), often lack the required specificity, potentially generating misleading examples that degrade fine-grained classifier performance. To address this, we propose Hierarchically Guided Fine-grained Augmentation (HiGFA). HiGFA leverages the temporal dynamics of the diffusion sampling process. It employs strong text and transformed contour guidance with fixed strengths in the early-to-mid sampling stages to establish overall scene, style, and structure. In the final sampling stages, HiGFA activates a specialized fine-grained classifier guidance and dynamically modulates the strength of all guidance signals based on prediction confidence. This hierarchical, confidence-driven orchestration enables HiGFA to generate diverse yet faithful synthetic images by intelligently balancing global structure formation with precise detail refinement. Experiments on several FGVC datasets demonstrate the effectiveness of HiGFA.
♻ ☆ TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection
Adapting CLIP for anomaly detection on unseen objects has shown strong potential in a zero-shot manner. However, existing methods typically rely on a single textual space to align with visual semantics across diverse objects and domains. The indiscriminate alignment hinders the model from accurately capturing varied anomaly semantics. We propose TokenCLIP, a token-wise adaptation framework that enables dynamic alignment between visual and learnable textual spaces for fine-grained anomaly learning. Rather than mapping all visual tokens to a single, token-agnostic textual space, TokenCLIP aligns each token with a customized textual subspace that represents its visual characteristics. Explicitly assigning a unique learnable textual space to each token is computationally intractable and prone to insufficient optimization. We instead expand the token-agnostic textual space into a set of orthogonal subspaces, and then dynamically assign each token to a subspace combination guided by semantic affinity, which jointly supports customized and efficient token-wise adaptation. To this end, we formulate dynamic alignment as an optimal transport problem, where all visual tokens in an image are transported to textual subspaces based on semantic similarity. The transport constraints of OT ensure sufficient optimization across subspaces and encourage them to focus on different semantics. Solving the problem yields a transport plan that adaptively assigns each token to semantically relevant subspaces. A top-k masking is then applied to sparsify the plan and specialize subspaces for distinct visual regions. Extensive experiments demonstrate the superiority of TokenCLIP.
♻ ☆ Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling
We present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only $\approx0.419 \text{s}$ per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling. \textbf{Project page:} \href{https://seominseok0429.github.io/Upsample-Anything/}{https://seominseok0429.github.io/Upsample-Anything/}
comment: 15 pages, 12 figures
♻ ☆ Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding
Vision-Language-Action (VLA) models have emerged as a promising framework for enabling generalist robots capable of perceiving, reasoning, and acting in the real world. These models usually build upon pretrained Vision-Language Models (VLMs), which excel at semantic understanding due to large-scale image and text pretraining. However, existing VLMs typically lack precise spatial understanding capabilities, as they are primarily tuned on 2D image-text pairs without 3D supervision. To address this limitation, recent approaches have incorporated explicit 3D inputs such as point clouds or depth maps, but this necessitates additional depth sensors or pre-trained depth estimation models, which may yield defective results. In contrast, our work introduces a plug-and-play module that implicitly incorporates 3D geometry features into VLA models by leveraging an off-the-shelf visual geometry foundation model. This integration provides the model with depth-aware visual representations, improving its ability to understand the geometric structure of the scene and the spatial relationships among objects from RGB images alone. We evaluate our method on a set of spatially challenging tasks in both simulation and the real world. Extensive evaluations show that our method significantly improves the performance of state-of-the-art VLA models across diverse scenarios.
♻ ☆ ConMamba: Contrastive Vision Mamba for Plant Disease Detection
Plant Disease Detection (PDD) is a key aspect of precision agriculture. However, existing deep learning methods often rely on extensively annotated datasets, which are time-consuming and costly to generate. Self-supervised Learning (SSL) offers a promising alternative by exploiting the abundance of unlabeled data. However, most existing SSL approaches suffer from high computational costs due to convolutional neural networks or transformer-based architectures. Additionally, they struggle to capture long-range dependencies in visual representation and rely on static loss functions that fail to align local and global features effectively. To address these challenges, we propose ConMamba, a novel SSL framework specially designed for PDD. ConMamba integrates the Vision Mamba Encoder (VME), which employs a bidirectional State Space Model (SSM) to capture long-range dependencies efficiently. Furthermore, we introduce a dual-level contrastive loss with dynamic weight adjustment to optimize local-global feature alignment. Experimental results on three benchmark datasets demonstrate that ConMamba significantly outperforms state-of-the-art methods across multiple evaluation metrics. This provides an efficient and robust solution for PDD.
♻ ☆ Teacher Encoder-Student Decoder Denoising Guided Segmentation Network for Anomaly Detection
Visual anomaly detection is a highly challenging task, often categorized as a one-class classification and segmentation problem. Recent studies have demonstrated that the student-teacher (S-T) framework effectively addresses this challenge. However, most S-T frameworks rely solely on pre-trained teacher networks to guide student networks in learning multi-scale similar features, overlooking the potential of the student networks to enhance learning through multi-scale feature fusion. In this study, we propose a novel model named PFADSeg, which integrates a pre-trained teacher network, a denoising student network with multi-scale feature fusion, and a guided anomaly segmentation network into a unified framework. By adopting a unique teacher-encoder and student-decoder denoising mode, the model improves the student network's ability to learn from teacher network features. Furthermore, an adaptive feature fusion mechanism is introduced to train a self-supervised segmentation network that synthesizes anomaly masks autonomously, significantly increasing detection performance. Rigorous evaluations on the widely-used MVTec AD dataset demonstrate that PFADSeg exhibits excellent performance, achieving an image-level AUC of 98.9%, a pixel-level mean precision of 76.4%, and an instance-level mean precision of 78.7%.
♻ ☆ Q-SAM2: Accurate Quantization for Segment Anything Model 2
The Segment Anything Model 2 (SAM2) is a powerful foundation model for promptable segmentation. However, its high computational and memory costs are a major barrier to deployment on resource-constrained devices. In this paper, we present Q-SAM2, an accurate low-bit quantization method that achieves high compression and high fidelity. To address performance degradation arising from challenging weight and activation distributions during quantization, Q-SAM2 introduces two novel contributions: Variance-Reduced Calibration (VRC), an initialization method that reduces weight statistical variance by minimizing the Frobenius norm over a small calibration batch; and Learnable Statistical Clipping (LSC), a Quantization-Aware Training (QAT) method that learns momentum-stabilized clipping factors to manage outliers in weights and activations. Comprehensive experiments demonstrate that Q-SAM2 achieves highly accurate inference with substantial efficiency gains, significantly surpassing state-of-the-art general QAT schemes, particularly in the ultra-low 2-bit regime. Specifically, Q-SAM2 achieves an accuracy gain of up to 9.7 ppt in J&F on the video segmentation benchmark and 7.3 ppt in mIoU for instance segmentation over the best competing QAT model, all while achieving an 8x reduction in model size compared to the BF16 baseline.
comment: 22 pages
♻ ☆ Beyond Complete Shapes: A Benchmark for Quantitative Evaluation of 3D Shape Surface Matching Algorithms
Finding correspondences between 3D deformable shapes is an important and long-standing problem in geometry processing, computer vision, graphics, and beyond. While various shape matching datasets exist, they are mostly static or limited in size, restricting their adaptation to different problem settings, including both full and partial shape matching. In particular the existing partial shape matching datasets are small (fewer than 100 shapes) and thus unsuitable for data-hungry machine learning approaches. Moreover, the type of partiality present in existing datasets is often artificial and far from realistic. To address these limitations, we introduce a generic and flexible framework for the procedural generation of challenging full and partial shape matching datasets. Our framework allows the propagation of custom annotations across shapes, making it useful for various applications. By utilising our framework and manually creating cross-dataset correspondences between seven existing (complete geometry) shape matching datasets, we propose a new large benchmark BeCoS with a total of 2543 shapes. Based on this, we offer several challenging benchmark settings, covering both full and partial matching, for which we evaluate respective state-of-the-art methods as baselines.
♻ ☆ From Spots to Pixels: Dense Spatial Gene Expression Prediction from Histology Images
Spatial transcriptomics (ST) measures gene expression at fine-grained spatial resolution, offering insights into tissue molecular landscapes. Previous methods for spatial gene expression prediction typically crop spots of interest from histopathology slide images, and train models to map each spot to a corresponding gene expression profile. However, these methods inherently lose the spatial resolution in gene expression: 1) each spot often contains multiple cells with distinct gene expression profiles; 2) spots are typically defined at fixed spatial resolutions, limiting the ability to predict gene expression at varying scales. To address these limitations, this paper presents PixNet, a dense prediction network capable of predicting spatially resolved gene expression across spots of varying sizes and scales directly from histopathology slide images. Different from previous methods that map individual spots to gene expression values, we generate a spatially dense continuous gene expression map from the histopathology slide image, and aggregate values within spots of interest to predict the gene expression. Our PixNet outperforms state-of-the-art methods on four common ST datasets in multiple spatial scales. The source code will be publicly available.
♻ ☆ Preventing Shortcut Learning in Medical Image Analysis through Intermediate Layer Knowledge Distillation from Specialist Teachers
Deep learning models are prone to learning shortcut solutions to problems using spuriously correlated yet irrelevant features of their training data. In high-risk applications such as medical image analysis, this phenomenon may prevent models from using clinically meaningful features when making predictions, potentially leading to poor robustness and harm to patients. We demonstrate that different types of shortcuts (those that are diffuse and spread throughout the image, as well as those that are localized to specific areas) manifest distinctly across network layers and can, therefore, be more effectively targeted through mitigation strategies that target the intermediate layers. We propose a novel knowledge distillation framework that leverages a teacher network fine-tuned on a small subset of task-relevant data to mitigate shortcut learning in a student network trained on a large dataset corrupted with a bias feature. Through extensive experiments on CheXpert, ISIC 2017, and SimBA datasets using various architectures (ResNet-18, AlexNet, DenseNet-121, and 3D CNNs), we demonstrate consistent improvements over traditional Empirical Risk Minimization, augmentation-based bias-mitigation, and group-based bias-mitigation approaches. In many cases, we achieve comparable performance with a baseline model trained on bias-free data, even on out-of-distribution test data. Our results demonstrate the practical applicability of our approach to real-world medical imaging scenarios where bias annotations are limited and shortcut features are difficult to identify a priori.
comment: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2025:020
♻ ☆ Unsupervised and Source-Free Ranking of Biomedical Segmentation Models
Model transfer presents a solution to the challenges of segmentation in the biomedical community, where the immense cost of data annotation is a major bottleneck in the use of deep learning. At the same time, hundreds of models get trained on biomedical data, submitted to challenges, and posted in model zoos and repositories. A major hurdle to wider adoption of pre-trained models lies in the lack of methods for best model selection. While such methods have been proposed for classification models, semantic and instance segmentation model ranking remain largely unaddressed, especially in a practically important setting where no labels are available on the target dataset. Similarly, if unsupervised domain adaptation is used, practitioners are faced with the task of selecting the best adapted model without target domain labels. Building on previous work linking model generalisation and consistency under perturbation, we propose the first unsupervised and source-free transferability estimator for semantic and instance segmentation tasks. We evaluate on multiple segmentation problems across biomedical imaging, finding a strong correlation between the rankings based on our estimator and rankings based on target dataset performance.
comment: 24 pages, 6 figures
♻ ☆ OMGSR: You Only Need One Mid-timestep Guidance for Real-World Image Super-Resolution
Denoising Diffusion Probabilistic Models (DDPMs) show promising potential in one-step Real-World Image Super-Resolution (Real-ISR). Current one-step Real-ISR methods typically inject the low-quality (LQ) image latent representation at the start or end timestep of the DDPM scheduler. Recent studies have begun to note that the LQ image latent and the pre-trained noisy latent representations are intuitively closer at a mid-timestep. However, a quantitative analysis of these latent representations remains lacking. Considering these latent representations can be decomposed into signal and noise, we propose a method based on the Signal-to-Noise Ratio (SNR) to pre-compute an average optimal mid-timestep for injection. To better approximate the pre-trained noisy latent representation, we further introduce the Latent Representation Refinement (LRR) loss via a LoRA-enhanced VAE encoder. We also fine-tune the backbone of the DDPM-based generative model using LoRA to perform one-step denoising at the average optimal mid-timestep. Based on these components, we present OMGSR, a GAN-based Real-ISR framework that employs a DDPM-based generative model as the generator and a DINOv3-ConvNeXt model with multi-level discriminator heads as the discriminator. We also propose the DINOv3-ConvNeXt DISTS (Dv3CD) loss, which is enhanced for structural perception at varying resolutions. Within the OMGSR framework, we develop OMGSR-S based on SD2.1-base. An ablation study confirms that our pre-computation strategy and LRR loss significantly improve the baseline. Comparative studies demonstrate that OMGSR-S achieves state-of-the-art performance across multiple metrics. Code is available at \hyperlink{Github}{https://github.com/wuer5/OMGSR}.
♻ ☆ Restore Text First, Enhance Image Later: Two-Stage Scene Text Image Super-Resolution with Glyph Structure Guidance
Current image super-resolution methods show strong performance on natural images but distort text, creating a fundamental trade-off between image quality and textual readability. To address this, we introduce TIGER (Text-Image Guided supEr-Resolution), a novel two-stage framework that breaks this trade-off through a "text-first, image-later" paradigm. TIGER explicitly decouples glyph restoration from image enhancement: it first reconstructs precise text structures and uses them to guide full-image super-resolution. This ensures high fidelity and readability. To support comprehensive training and evaluation, we present the UZ-ST (UltraZoom-Scene Text) dataset, the first Chinese scene text dataset with extreme zoom. Extensive experiments show TIGER achieves state-of-the-art performance, enhancing readability and image quality.
♻ ☆ Motion-R1: Enhancing Motion Generation with Decomposed Chain-of-Thought and RL Binding
Text-to-Motion generation has become a fundamental task in human-machine interaction, enabling the synthesis of realistic human motions from natural language descriptions. Although recent advances in large language models and reinforcement learning have contributed to high-quality motion generation, two major challenges remain. Existing approaches often fail to capture the temporal and causal complexities inherent in natural language, leading to oversimplified or incoherent motions. Additionally, RL-based methods are frequently overly complex, hindering their scalability and adaptability across various motion generation tasks. To address these challenges, we propose Motion-R1, a novel framework that combines decomposed Chain-of-Thought reasoning with reinforcement learning to enhance both the quality and interpretability of generated motions. Specifically, we introduce the Decomposed CoT Data Engine, which leverages an automated pipeline to synthesize high-quality reasoning data, allowing the model to better capture the temporal dependencies and causal relationships of human motion. We also propose RL Binding, a reinforcement learning strategy that incorporates multi-modal text-motion alignment into the RL reward function, guiding the model to produce motions that are both semantically accurate and motionally realistic. Extensive experiments across benchmark datasets demonstrate that Motion-R1 achieves state-of-the-art performance, with a 3.5% improvement in MM-Dist on HumanML3D and improvements in R-Precision and FID on KIT-ML and BABEL, surpassing existing methods across key metrics and highlighting its superior capability in handling complex motion generation tasks. Project page: https://motion-r1.github.io/.
♻ ☆ Directed-CP: Directed Collaborative Perception for Connected and Autonomous Vehicles via Proactive Attention ICRA'25
Collaborative perception (CP) leverages visual data from connected and autonomous vehicles (CAV) to enhance an ego vehicle's field of view (FoV). Despite recent progress, current CP methods expand the ego vehicle's 360-degree perceptual range almost equally, which faces two key challenges. Firstly, in areas with uneven traffic distribution, focusing on directions with little traffic offers limited benefits. Secondly, under limited communication budgets, allocating excessive bandwidth to less critical directions lowers the perception accuracy in more vital areas. To address these issues, we propose Direct-CP, a proactive and direction-aware CP system aiming at improving CP in specific directions. Our key idea is to enable an ego vehicle to proactively signal its interested directions and readjust its attention to enhance local directional CP performance. To achieve this, we first propose an RSU-aided direction masking mechanism that assists an ego vehicle in identifying vital directions. Additionally, we design a direction-aware selective attention module to wisely aggregate pertinent features based on ego vehicle's directional priorities, communication budget, and the positional data of CAVs. Moreover, we introduce a direction-weighted detection loss (DWLoss) to capture the divergence between directional CP outcomes and the ground truth, facilitating effective model training. Extensive experiments on the V2X-Sim 2.0 dataset demonstrate that our approach achieves 19.8\% higher local perception accuracy in interested directions and 2.5\% higher overall perception accuracy than the state-of-the-art methods in collaborative 3D object detection tasks. Codes are available at https://github.com/yihangtao/Directed-CP.git.
comment: Accepted by ICRA'25
♻ ☆ Prototypical Contrastive Learning-based CLIP Fine-tuning for Object Re-identification
This work aims to adapt large-scale pre-trained vision-language models, such as contrastive language-image pretraining (CLIP), to enhance the performance of object reidentification (Re-ID) across various supervision settings. Although prompt learning has enabled a recent work named CLIP-ReID to achieve promising performance, the underlying mechanisms and the necessity of prompt learning remain unclear due to the absence of semantic labels in ReID tasks. In this work, we first analyze the role prompt learning in CLIP-ReID and identify its limitations. Based on our investigations, we propose a simple yet effective approach to adapt CLIP for supervised object Re-ID. Our approach directly fine-tunes the image encoder of CLIP using a prototypical contrastive learning (PCL) loss, eliminating the need for prompt learning. Experimental results on both person and vehicle Re-ID datasets demonstrate the competitiveness of our method compared to CLIP-ReID. Furthermore, we extend our PCL-based CLIP fine-tuning approach to unsupervised scenarios, where we achieve state-of-the art performance.
♻ ☆ K-FACE: A Large-Scale KIST Face Database in Consideration with Unconstrained Environments
In this paper, we introduce a new large-scale face database from KIST, denoted as K-FACE, and describe a novel capturing device specifically designed to obtain the data. The K-FACE database contains more than 1 million high-quality images of 1,000 subjects selected by considering the ratio of gender and age groups. It includes a variety of attributes, including 27 poses, 35 lighting conditions, three expressions, and occlusions by the combination of five types of accessories. As the K-FACE database is systematically constructed through a hemispherical capturing system with elaborate lighting control and multiple cameras, it is possible to accurately analyze the effects of factors that cause performance degradation, such as poses, lighting changes, and accessories. We consider not only the balance of external environmental factors, such as pose and lighting, but also the balance of personal characteristics such as gender and age group. The gender ratio is the same, while the age groups of subjects are uniformly distributed from the 20s to 50s for both genders. The K-FACE database can be extensively utilized in various vision tasks, such as face recognition, face frontalization, illumination normalization, face age estimation, and three-dimensional face model generation. We expect systematic diversity and uniformity of the K-FACE database to promote these research fields.
comment: 8 pages, 8 figures
♻ ☆ Roadside Monocular 3D Detection Prompted by 2D Detection WACV 2026
Roadside monocular 3D detection requires detecting objects of predefined classes in an RGB frame and predicting their 3D attributes, such as bird's-eye-view (BEV) locations. It has broad applications in traffic control, vehicle-vehicle communication, and vehicle-infrastructure cooperative perception. To address this task, we introduce Promptable 3D Detector (Pro3D), a novel detector design that leverages 2D detections as prompts. We build our Pro3D upon two key insights. First, compared to a typical 3D detector, a 2D detector is ``easier'' to train due to fewer loss terms and performs significantly better at localizing objects w.r.t 2D metrics. Second, once 2D detections precisely locate objects in the image, a 3D detector can focus on lifting these detections into 3D BEV, especially when fixed camera pose or scene geometry provide an informative prior. To encode and incorporate 2D detections, we explore three methods: (a) concatenating features from both 2D and 3D detectors, (b) attentively fusing 2D and 3D detector features, and (c) encoding properties of predicted 2D bounding boxes \{$x$, $y$, width, height, label\} and attentively fusing them with the 3D detector feature. Interestingly, the third method significantly outperforms the others, underscoring the effectiveness of 2D detections as prompts that offer precise object targets and allow the 3D detector to focus on lifting them into 3D. Pro3D is adaptable for use with a wide range of 2D and 3D detectors with minimal modifications. Comprehensive experiments demonstrate that our Pro3D significantly enhances existing methods, achieving state-of-the-art results on two contemporary benchmarks.
comment: Accepted by WACV 2026
♻ ☆ Monocular Person Localization under Camera Ego-motion IROS2025
Localizing a person from a moving monocular camera is critical for Human-Robot Interaction (HRI). To estimate the 3D human position from a 2D image, existing methods either depend on the geometric assumption of a fixed camera or use a position regression model trained on datasets containing little camera ego-motion. These methods are vulnerable to severe camera ego-motion, resulting in inaccurate person localization. We consider person localization as a part of a pose estimation problem. By representing a human with a four-point model, our method jointly estimates the 2D camera attitude and the person's 3D location through optimization. Evaluations on both public datasets and real robot experiments demonstrate our method outperforms baselines in person localization accuracy. Our method is further implemented into a person-following system and deployed on an agile quadruped robot.
comment: Accepted by IROS2025. Project page: https://medlartea.github.io/rpf-quadruped/
♻ ☆ COLI: A Hierarchical Efficient Compressor for Large Images
The escalating adoption of high-resolution, large-field-of-view imagery amplifies the need for efficient compression methodologies. Conventional techniques frequently fail to preserve critical image details, while data-driven approaches exhibit limited generalizability. Implicit Neural Representations (INRs) present a promising alternative by learning continuous mappings from spatial coordinates to pixel intensities for individual images, thereby storing network weights rather than raw pixels and avoiding the generalization problem. However, INR-based compression of large images faces challenges including slow compression speed and suboptimal compression ratios. To address these limitations, we introduce COLI (Compressor for Large Images), a novel framework leveraging Neural Representations for Videos (NeRV). First, recognizing that INR-based compression constitutes a training process, we accelerate its convergence through a pretraining-finetuning paradigm, mixed-precision training, and reformulation of the sequential loss into a parallelizable objective. Second, capitalizing on INRs' transformation of image storage constraints into weight storage, we implement Hyper-Compression, a novel post-training technique to substantially enhance compression ratios while maintaining minimal output distortion. Evaluations across two medical imaging datasets demonstrate that COLI consistently achieves competitive or superior PSNR and SSIM metrics at significantly reduced bits per pixel (bpp), while accelerating NeRV training by up to 4 times.
♻ ☆ Sim-DETR: Unlock DETR for Temporal Sentence Grounding ICCV 2025
Temporal sentence grounding aims to identify exact moments in a video that correspond to a given textual query, typically addressed with detection transformer (DETR) solutions. However, we find that typical strategies designed to enhance DETR do not improve, and may even degrade, its performance in this task. We systematically analyze and identify the root causes of this abnormal behavior: (1) conflicts between queries from similar target moments and (2) internal query conflicts due to the tension between global semantics and local localization. Building on these insights, we propose a simple yet powerful baseline, Sim-DETR, which extends the standard DETR with two minor modifications in the decoder layers: (1) constraining self-attention between queries based on their semantic and positional overlap and (2) adding query-to-frame alignment to bridge the global and local contexts. Experiments demonstrate that Sim-DETR unlocks the full potential of DETR for temporal sentence grounding, offering a strong baseline for future research.
comment: This work is accepted by ICCV 2025
♻ ☆ Benchmarking Endoscopic Surgical Image Restoration and Beyond
In endoscopic surgery, a clear and high-quality visual field is critical for surgeons to make accurate intraoperative decisions. However, persistent visual degradation, including smoke generated by energy devices, lens fogging from thermal gradients, and lens contamination due to blood or tissue fluid splashes during surgical procedures, severely impairs visual clarity. These degenerations can seriously hinder surgical workflow and pose risks to patient safety. To systematically investigate and address various forms of surgical scene degradation, we introduce a real- world open-source surgical image restoration dataset covering endoscopic environments, called SurgClean, which involves multi-type image restoration tasks from two medical sites, i.e., desmoking, defogging, and desplashing. SurgClean comprises 3,113 images with diverse degradation types and corresponding paired reference labels. Based on SurgClean, we establish a standardized evaluation benchmark and provide performance for 22 representative generic task-specific image restoration approaches, including 12 generic and 10 task-specific image restoration approaches. Experimental results reveal substantial performance gaps relative to clinical requirements, highlighting a critical opportunity for algorithm advancements in intelligent surgical restoration. Furthermore, we explore the degradation discrepancies between surgical and natural scenes from structural perception and semantic under- standing perspectives, providing fundamental insights for domain-specific image restoration research. Our work aims to empower restoration algorithms and improve the efficiency of clinical procedures.
♻ ☆ Prompt-guided Disentangled Representation for Action Recognition
Action recognition is a fundamental task in video understanding. Existing methods typically extract unified features to process all actions in one video, which makes it challenging to model the interactions between different objects in multi-action scenarios. To alleviate this issue, we explore disentangling any specified actions from complex scenes as an effective solution. In this paper, we propose Prompt-guided Disentangled Representation for Action Recognition (ProDA), a novel framework that disentangles any specified actions from a multi-action scene. ProDA leverages Spatio-temporal Scene Graphs (SSGs) and introduces Dynamic Prompt Module (DPM) to guide a Graph Parsing Neural Network (GPNN) in generating action-specific representations. Furthermore, we design a video-adapted GPNN that aggregates information using dynamic weights. Experiments in video action recognition demonstrate the effectiveness of our approach when compared with the state-of-the-art methods. Our code can be found in https://github.com/iamsnaping/ProDA.git
♻ ☆ PairHuman: A High-Fidelity Photographic Dataset for Customized Dual-Person Generation
Personalized dual-person portrait customization has considerable potential applications, such as preserving emotional memories and facilitating wedding photography planning. However, the absence of a benchmark dataset hinders the pursuit of high-quality customization in dual-person portrait generation. In this paper, we propose the PairHuman dataset, which is the first large-scale benchmark dataset specifically designed for generating dual-person portraits that meet high photographic standards. The PairHuman dataset contains more than 100K images that capture a variety of scenes, attire, and dual-person interactions, along with rich metadata, including detailed image descriptions, person localization, human keypoints, and attribute tags. We also introduce DHumanDiff, which is a baseline specifically crafted for dual-person portrait generation that features enhanced facial consistency and simultaneously balances in personalized person generation and semantic-driven scene creation. Finally, the experimental results demonstrate that our dataset and method produce highly customized portraits with superior visual quality that are tailored to human preferences. Our dataset is publicly available at https://github.com/annaoooo/PairHuman.
comment: 46 pages, 31 figures
♻ ☆ Sketch-1-to-3: One Single Sketch to 3D Detailed Face Reconstruction ACM MM
3D face reconstruction from a single sketch is a critical yet underexplored task with significant practical applications. The primary challenges stem from the substantial modality gap between 2D sketches and 3D facial structures, including: (1) accurately extracting facial keypoints from 2D sketches; (2) preserving diverse facial expressions and fine-grained texture details; and (3) training a high-performing model with limited data. In this paper, we propose Sketch-1-to-3, a novel framework for realistic 3D face reconstruction from a single sketch, to address these challenges. Specifically, we first introduce the Geometric Contour and Texture Detail (GCTD) module, which enhances the extraction of geometric contours and texture details from facial sketches. Additionally, we design a deep learning architecture with a domain adaptation module and a tailored loss function to align sketches with the 3D facial space, enabling high-fidelity expression and texture reconstruction. To facilitate evaluation and further research, we construct SketchFaces, a real hand-drawn facial sketch dataset, and Syn-SketchFaces, a synthetic facial sketch dataset. Extensive experiments demonstrate that Sketch-1-to-3 achieves state-of-the-art performance in sketch-based 3D face reconstruction.
comment: Accepted by ACM MMAsia 2025
Computers and Society
☆ Normative active inference: A numerical proof of principle for a computational and economic legal analytic approach to AI governance
This paper presents a computational account of how legal norms can influence the behavior of artificial intelligence (AI) agents, grounded in the active inference framework (AIF) that is informed by principles of economic legal analysis (ELA). The ensuing model aims to capture the complexity of human decision-making under legal constraints, offering a candidate mechanism for agent governance in AI systems, that is, the (auto)regulation of AI agents themselves rather than human actors in the AI industry. We propose that lawful and norm-sensitive AI behavior can be achieved through regulation by design, where agents are endowed with intentional control systems, or behavioral safety valves, that guide real-time decisions in accordance with normative expectations. To illustrate this, we simulate an autonomous driving scenario in which an AI agent must decide when to yield the right of way by balancing competing legal and pragmatic imperatives. The model formalizes how AIF can implement context-dependent preferences to resolve such conflicts, linking this mechanism to the conception of law as a scaffold for rational decision-making under uncertainty. We conclude by discussing how context-dependent preferences could function as safety mechanisms for autonomous agents, enhancing lawful alignment and risk mitigation in AI governance.
comment: 19 pages, 6 figures, 1 box
☆ Data Flows and Colonial Regimes in Africa: A Critical Analysis of the Colonial Futurities Embedded in AI Ecosystems
This chapter seeks to frame the elemental and invisible problems of AI and big data in the African context by examining digital sites and infrastructure through the lens of power and interests. It will present reflections on how these sites are using AI recommendation algorithms to recreate new digital societies in the region, how they have the potential to propagate algorithmic colonialism and negative gender norms, and what this means for the regional sustainable development agenda. The chapter proposes adopting business models that embrace response-ability and consider the existence of alternative socio-material worlds of AI. These reflections will mainly come from ongoing discussions with Kenyan social media users in this authors' user space talks, personal experiences and six months of active participant observations done by the authors.
comment: 12 pages
☆ Facilitating the Integration of LLMs Into Online Experiments With Simple Chat
As large language models (LLMs) become increasingly prevalent, understanding human-LLM interactions is emerging as a central priority in psychological research. Online experiments offer an efficient means to study human-LLM interactions, yet integrating LLMs into established survey platforms remains technically demanding, particularly when aiming for ecologically valid, real-time conversational experiences with strong experimental control. We introduce Simple Chat, an open-source, research-focused chat interface that streamlines LLM integration for platforms such as Qualtrics, oTree, and LimeSurvey, while presenting a unified participant experience across conditions. Simple Chat connects to both commercial providers and open-weights models, supports streaming responses to preserve conversational flow, and offers an administrative interface for fine-grained control of prompts and interface features. By reducing technical barriers, standardizing interfaces, and improving participant experience, Simple Chat helps advance the study of human-LLM interaction. In this article, we outline Simple Chat's key features, provide a step-by-step tutorial, and demonstrate its utility through two illustrative case studies.
☆ AI Consciousness and Existential Risk
In AI, the existential risk denotes the hypothetical threat posed by an artificial system that would possess both the capability and the objective, either directly or indirectly, to eradicate humanity. This issue is gaining prominence in scientific debate due to recent technical advancements and increased media coverage. In parallel, AI progress has sparked speculation and studies about the potential emergence of artificial consciousness. The two questions, AI consciousness and existential risk, are sometimes conflated, as if the former entailed the latter. Here, I explain that this view stems from a common confusion between consciousness and intelligence. Yet these two properties are empirically and theoretically distinct. Arguably, while intelligence is a direct predictor of an AI system's existential threat, consciousness is not. There are, however, certain incidental scenarios in which consciousness could influence existential risk, in either direction. Consciousness could be viewed as a means towards AI alignment, thereby lowering existential risk; or, it could be a precondition for reaching certain capabilities or levels of intelligence, and thus positively related to existential risk. Recognizing these distinctions can help AI safety researchers and public policymakers focus on the most pressing issues.
☆ LLM Chatbots in High School Programming: Exploring Behaviors and Interventions
This study uses a Design-Based Research (DBR) cycle to refine the integration of Large Language Models (LLMs) in high school programming education. The initial problem was identified in an Intervention Group where, in an unguided setting, a higher proportion of executive, solution-seeking queries correlated strongly and negatively with exam performance. A contemporaneous Comparison Group demonstrated that without guidance, these unproductive help-seeking patterns do not self-correct, with engagement fluctuating and eventually declining. This insight prompted a mid-course pedagogical intervention in the first group, designed to teach instrumental help-seeking. The subsequent evaluation confirmed the intervention's success, revealing a decrease in executive queries, as well as a shift toward more productive learning workflows. However, this behavioral change did not translate into a statistically significant improvement in exam grades, suggesting that altering tool-use strategies alone may be insufficient to overcome foundational knowledge gaps. The DBR process thus yields a more nuanced principle: the educational value of an LLM depends on a pedagogy that scaffolds help-seeking, but this is only one part of the complex process of learning.
☆ Regularity as Structural Amplifier, Not Trap: A Causal and Archetype-Based Analysis of Dropout in a Constrained Engineering Curriculum
Engineering programmes, particularly in Latin America, are often governed by rigid curricula and strict regularity rules that are claimed to create a Regularity Trap for capable students. This study tests that causal hypothesis using the CAPIRE framework, a leakage-aware pipeline that integrates curriculum topology and causal estimation. Using longitudinal data from 1,343 civil engineering students in Argentina, we formalize academic lag (accumulated friction) as a treatment and academic velocity as an ability proxy. A manual LinearDML estimator is employed to assess the average (ATE) and conditional (CATE) causal effects of lag on subsequent dropout, controlling for macro shocks (strikes, inflation). Results confirm that academic lag significantly increases dropout risk overall (ATE = 0.0167, p < 0.0001). However, the effect decreases sharply for high-velocity (high-ability) students, contradicting the universal Trap hypothesis. Archetype analysis (UMAP/DBSCAN) shows that friction disproportionately harms trajectories already characterized by high initial friction and unstable progression. 8 We conclude that regularity rules function as a Structural Amplifier of pre-existing vulnerability rather than a universal trap. This has direct implications for engineering curriculum design, demanding targeted slack allocation and intervention policies to reduce friction at core basic-cycle courses
comment: 28 pages, figures 9, tables 3
☆ Large Language Models Require Curated Context for Reliable Political Fact-Checking -- Even with Reasoning and Web Search
Large language models (LLMs) have raised hopes for automated end-to-end fact-checking, but prior studies report mixed results. As mainstream chatbots increasingly ship with reasoning capabilities and web search tools -- and millions of users already rely on them for verification -- rigorous evaluation is urgent. We evaluate 15 recent LLMs from OpenAI, Google, Meta, and DeepSeek on more than 6,000 claims fact-checked by PolitiFact, comparing standard models with reasoning- and web-search variants. Standard models perform poorly, reasoning offers minimal benefits, and web search provides only moderate gains, despite fact-checks being available on the web. In contrast, a curated RAG system using PolitiFact summaries improved macro F1 by 233% on average across model variants. These findings suggest that giving models access to curated high-quality context is a promising path for automated fact-checking.
☆ MAGMA-Edu: Multi-Agent Generative Multimodal Framework for Text-Diagram Educational Question Generation
Educational illustrations play a central role in communicating abstract concepts, yet current multimodal large language models (MLLMs) remain limited in producing pedagogically coherent and semantically consistent educational visuals. We introduce MAGMA-Edu, a self-reflective multi-agent framework that unifies textual reasoning and diagrammatic synthesis for structured educational problem generation. Unlike existing methods that treat text and image generation independently, MAGMA-Edu employs a two-stage co-evolutionary pipeline: (1) a generation-verification-reflection loop that iteratively refines question statements and solutions for mathematical accuracy, and (2) a code-based intermediate representation that enforces geometric fidelity and semantic alignment during image rendering. Both stages are guided by internal self-reflection modules that evaluate and revise outputs until domain-specific pedagogical constraints are met. Extensive experiments on multimodal educational benchmarks demonstrate the superiority of MAGMA-Edu over state-of-the-art MLLMs. Compared to GPT-4o, MAGMA-Edu improves the average textual metric from 57.01 to 92.31 (+35.3 pp) and boosts image-text consistency (ITC) from 13.20 to 85.24 (+72 pp). Across all model backbones, MAGMA-Edu achieves the highest scores (Avg-Text 96.20, ITC 99.12), establishing a new state of the art for multimodal educational content generation and demonstrating the effectiveness of self-reflective multi-agent collaboration in pedagogically aligned vision-language reasoning.
♻ ☆ The Loss of Control Playbook: Degrees, Dynamics, and Preparedness
This research report addresses the absence of an actionable definition for Loss of Control (LoC) in AI systems by developing a novel taxonomy and preparedness framework. Despite increasing policy and research attention, existing LoC definitions vary significantly in scope and timeline, hindering effective LoC assessment and mitigation. To address this issue, we draw from an extensive literature review and propose a graded LoC taxonomy, based on the metrics of severity and persistence, that distinguishes between Deviation, Bounded LoC, and Strict LoC. We model pathways toward a societal state of vulnerability in which sufficiently advanced AI systems have acquired or could acquire the means to cause Bounded or Strict LoC once a catalyst, either misalignment or pure malfunction, materializes. We argue that this state becomes increasingly likely over time, absent strategic intervention, and propose a strategy to avoid reaching a state of vulnerability. Rather than focusing solely on intervening on AI capabilities and propensities potentially relevant for LoC or on preventing potential catalysts, we introduce a complementary framework that emphasizes three extrinsic factors: Deployment context, Affordances, and Permissions (the DAP framework). Compared to work on intrinsic factors and catalysts, this framework has the unfair advantage of being actionable today. Finally, we put forward a plan to maintain preparedness and prevent the occurrence of LoC outcomes should a state of societal vulnerability be reached, focusing on governance measures (threat modeling, deployment policies, emergency response) and technical controls (pre-deployment testing, control measures, monitoring) that could maintain a condition of perennial suspension.
♻ ☆ Fairness in Multi-modal Medical Diagnosis with Demonstration Selection
Multimodal large language models (MLLMs) have shown strong potential for medical image reasoning, yet fairness across demographic groups remains a major concern. Existing debiasing methods often rely on large labeled datasets or fine-tuning, which are impractical for foundation-scale models. We explore In-Context Learning (ICL) as a lightweight, tuning-free alternative for improving fairness. Through systematic analysis, we find that conventional demonstration selection (DS) strategies fail to ensure fairness due to demographic imbalance in selected exemplars. To address this, we propose Fairness-Aware Demonstration Selection (FADS), which builds demographically balanced and semantically relevant demonstrations via clustering-based sampling. Experiments on multiple medical imaging benchmarks show that FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining strong accuracy, offering an efficient and scalable path toward fair medical image reasoning. These results highlight the potential of fairness-aware in-context learning as a scalable and data-efficient solution for equitable medical image reasoning.
comment: 10 pages (including 2 pages of references), 4 figures. This work explores fairness in multi-modal medical image reasoning using in-context learning
♻ ☆ Lost in translation: using global fact-checks to measure multilingual misinformation prevalence, spread, and evolution
Misinformation and disinformation are growing threats in the digital age, affecting people across languages and borders. However, no research has investigated the prevalence of multilingual misinformation and quantified the extent to which misinformation diffuses across languages. This paper investigates the prevalence and dynamics of multilingual misinformation through an analysis of 264,487 fact-checks spanning 95 languages. To study the evolution of claims over time and mutations across languages, we represent fact-checks with multilingual sentence embeddings and build a graph where semantically similar claims are linked. We provide quantitative evidence of repeated fact-checking efforts and establish that claims diffuse across languages. Specifically, we find that while the majority of misinformation claims are only fact-checked once, 10.26%, corresponding to more than 27,000 claims, are checked multiple times. Using fact-checks as a proxy for the spread of misinformation, we find 32.26% of repeated claims cross linguistic boundaries, suggesting that some misinformation permeates language barriers. However, spreading patterns exhibit strong assortativity, with misinformation more likely to spread within the same language or language family. Next we show that fact-checkers take more time to fact-check claims that have crossed language barriers and model the temporal and cross-lingual evolution of claims. We analyze connected components and shortest paths connecting different versions of a claim finding that claims gradually drift over time and undergo greater alteration when traversing languages. Misinformation changes over time, reducing the effectiveness of static claim matching algorithms. The findings advocate for expanded information sharing between fact-checkers globally while underscoring the importance of localized verification.
♻ ☆ Scaling Success: A Systematic Review of Peer Grading Strategies for Accuracy, Efficiency, and Learning in Contemporary Education
Peer grading has emerged as a scalable solution for assessment in large and online classrooms, offering both logistical efficiency and pedagogical value. However, designing effective peer-grading systems remains challenging due to persistent concerns around accuracy, fairness, reliability, and student engagement. This paper presents a systematic review of 122 peer-reviewed studies on peer grading spanning over four decades. Drawing from this literature, we propose a comprehensive taxonomy that organizes peer grading systems along two key dimensions: (1) evaluation approaches and (2) reviewer weighting strategies. We analyze how different design choices impact grading accuracy, fairness, student workload, and learning outcomes. Our findings highlight the strengths and limitations of each method. Notably, we found that formative feedback -- often regarded as the most valuable aspect of peer assessment -- is seldom incorporated as a quality-based weighting factor in summative grade synthesis techniques. Furthermore, no single reviewer weighting strategy proves universally optimal; each has its trade-offs. Hybrid strategies that combine multiple techniques could show the greatest promise. Our taxonomy offers a practical framework for educators and researchers aiming to design peer grading systems that are accurate, equitable, and pedagogically meaningful.
comment: Accepted for presentation at the Frontiers in Education Conference, Nashville, Tennessee, USA, 2-5 November 2025
♻ ☆ Easy-access online social media metrics are associated with misinformation sharing activity
Misinformation poses a significant challenge studied extensively by researchers, yet acquiring data to identify primary sharers is time-consuming and challenging. To address this, we propose a low-barrier approach to differentiate social media users who are more likely to share misinformation from those who are less likely. Leveraging insights from previous studies, we demonstrate that easy-access online social network metrics-average daily tweet count, and account age-can be leveraged to help identify potential low factuality content spreaders on X (previously known as Twitter). We find that higher tweet frequency is positively associated with low factuality in shared content, while account age is negatively associated with it. We also find that some of the effects differ depending on the number of accounts a user follows. Our findings show that relying on these easy-access social network metrics could serve as a low-barrier approach for initial identification of users who are more likely to spread misinformation, and therefore contribute to combating misinformation effectively on social media platforms.
♻ ☆ Mathematical Politics
Politics today is largely about the art of messaging to influence the public, but the mathematical theory of messaging -- information and communication theory -- can turn this art into a precise analysis, both qualitative and quantitative, that enables us to gain retrospective understandings of past political events and to make forward-looking future predictions.
comment: 17 pages, 4 figures, discussino on information transmission included
♻ ☆ Future-Back Threat Modeling: A Foresight-Driven Security Framework
Traditional threat modeling remains reactive-focused on known TTPs and past incident data, while threat prediction and forecasting frameworks are often disconnected from operational or architectural artifacts. This creates a fundamental weakness: the most serious cyber threats often do not arise from what is known, but from what is assumed, overlooked, or not yet conceived, and frequently originate from the future, such as artificial intelligence, information warfare, and supply chain attacks, where adversaries continuously develop new exploits that can bypass defenses built on current knowledge. To address this mental gap, this paper introduces the theory and methodology of Future-Back Threat Modeling (FBTM). This predictive approach begins with envisioned future threat states and works backward to identify assumptions, gaps, blind spots, and vulnerabilities in the current defense architecture, providing a clearer and more accurate view of impending threats so that we can anticipate their emergence and shape the future we want through actions taken now. The proposed methodology further aims to reveal known unknowns and unknown unknowns, including tactics, techniques, and procedures that are emerging, anticipated, and plausible. This enhances the predictability of adversary behavior, particularly under future uncertainty, helping security leaders make informed decisions today that shape more resilient security postures for the future.
♻ ☆ How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference
This paper introduces an infrastructure-aware benchmarking framework for quantifying the environmental footprint of LLM inference across 30 state-of-the-art models in commercial datacenters. The framework combines public API performance data with company-specific environmental multipliers and statistical inference of hardware configurations. We additionally utilize cross-efficiency Data Envelopment Analysis (DEA) to rank models by performance relative to environmental cost and provide a dynamically updated dashboard that visualizes model-level energy, water, and carbon metrics. Results show the most energy-intensive models exceed 29 Wh per long prompt, over 65 times the most efficient systems. Even a 0.42 Wh short query, when scaled to 700M queries/day, aggregates to annual electricity comparable to 35{,}000 U.S. homes, evaporative freshwater equal to the annual drinking needs of 1.2M people, and carbon emissions requiring a Chicago-sized forest to offset. These findings highlight a growing paradox: as AI becomes cheaper and faster, global adoption drives disproportionate resource consumption. Our methodology offers a standardized, empirically grounded basis for sustainability benchmarking and accountability in AI deployment.
Computation and Language
☆ Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting
This study systematically evaluated the mathematical reasoning capabilities of Large Language Models (LLMs) using the 2026 Korean College Scholastic Ability Test (CSAT) Mathematics section, ensuring a completely contamination-free evaluation environment. To address data leakage issues in existing benchmarks, we digitized all 46 questions (22 common and 24 elective) within two hours of the exam's public release, eliminating any possibility of inclusion in model training data. We conducted comprehensive evaluations of 24 state-of-the-art LLMs across varying input modalities (text, image, text+figure) and prompt languages (Korean, English). GPT-5 Codex achieved the only perfect score (100 points) with text input and Korean prompts, while Grok 4, GPT-5, and Deepseek R1 scored above 95 points. Notably, gpt-oss-20B achieved 95.7 points despite its relatively small size, demonstrating high cost-effectiveness. Problem-specific analysis revealed geometry as the weakest domain (77.7% average) with significant performance degradation on 4-point high-difficulty problems. Text input consistently outperformed image input, while prompt language effects varied by model scale. In reasoning enhancement experiments with GPT-5 series, increased reasoning intensity improved performance (from 82.6 to 100 points) but quadrupled token usage and drastically reduced efficiency, suggesting that models with minimal reasoning may be more practical. This research contributes: (1) implementation of a completely unexposed evaluation environment, (2) a real-exam-based LLM assessment framework, and (3) a practical evaluation perspective integrating performance, cost, and time considerations. Detailed results and model comparisons are available at the 2026 Korean CSAT LLM Evaluation Leaderboard (https://isoft.cnu.ac.kr/csat2026/).
comment: 52 pages, Korean
☆ No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases
Large Language Models (LLMs) inherit societal biases from their training data, potentially leading to harmful or unfair outputs. While various techniques aim to mitigate these biases, their effects are often evaluated only along the dimension of the bias being targeted. This work investigates the cross-category consequences of targeted bias mitigation. We study four bias mitigation techniques applied across ten models from seven model families, and we explore racial, religious, profession- and gender-related biases. We measure the impact of debiasing on model coherence and stereotypical preference using the StereoSet benchmark. Our results consistently show that while targeted mitigation can sometimes reduce bias in the intended dimension, it frequently leads to unintended and often negative consequences in others, such as increasing model bias and decreasing general coherence. These findings underscore the critical need for robust, multi-dimensional evaluation tools when examining and developing bias mitigation strategies to avoid inadvertently shifting or worsening bias along untargeted axes.
☆ Majority of the Bests: Improving Best-of-N via Bootstrapping
Sampling multiple outputs from a Large Language Model (LLM) and selecting the most frequent (Self-consistency) or highest-scoring (Best-of-N) candidate is a popular approach to achieve higher accuracy in tasks with discrete final answers. Best-of-N (BoN) selects the output with the highest reward, and with perfect rewards, it often achieves near-perfect accuracy. With imperfect rewards from reward models, however, BoN fails to reliably find the correct answer and its performance degrades drastically. We consider the distribution of BoN's outputs and highlight that, although the correct answer does not usually have a probability close to one under imperfect rewards, it is often the most likely outcome. This suggests that the mode of this distribution can be more reliably correct than a sample from it. Based on this idea, we propose Majority-of-the-Bests (MoB), a novel selection mechanism that estimates the output distribution of BoN via bootstrapping and selects its mode. Experimental results across five benchmarks, three different base LLMs, and two reward models demonstrate consistent improvements over BoN in 25 out of 30 setups. We also provide theoretical results for the consistency of the bootstrapping. MoB serves as a simple, yet strong alternative to BoN and self-consistency, and more broadly, motivates further research in more nuanced selection mechanisms.
☆ OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph
We present OpenGloss, a synthetic encyclopedic dictionary and semantic knowledge graph for English that integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource. OpenGloss contains 537K senses across 150K lexemes, on par with WordNet 3.1 and Open English WordNet, while providing more than four times as many sense definitions. These lexemes include 9.1M semantic edges, 1M usage examples, 3M collocations, and 60M words of encyclopedic content. Generated through a multi-agent procedural generation pipeline with schema-validated LLM outputs and automated quality assurance, the entire resource was produced in under one week for under $1,000. This demonstrates that structured generation can create comprehensive lexical resources at cost and time scales impractical for manual curation, enabling rapid iteration as foundation models improve. The resource addresses gaps in pedagogical applications by providing integrated content -- definitions, examples, collocations, encyclopedias, etymology -- that supports both vocabulary learning and natural language processing tasks. As a synthetically generated resource, OpenGloss reflects both the capabilities and limitations of current foundation models. The dataset is publicly available on Hugging Face under CC-BY 4.0, enabling researchers and educators to build upon and adapt this resource.
comment: 30 pages, 5 figures, 8 tables. Dataset available at https://huggingface.co/datasets/mjbommar/opengloss-dictionary
Prompt Optimization as a State-Space Search Problem
Language Models are extremely susceptible to performance collapse with even small changes to input prompt strings. Libraries such as DSpy (from Stanford NLP) avoid this problem through demonstration-based prompt optimisation. Inspired by this, I propose an alternative approach that treats prompt optimisation as a classical state-space search problem. I model the prompt space as a graph where nodes represent prompt states and edges correspond to deliberate transformations such as shortening, adding examples, or re- ordering content. Using beam search and random walk algorithms, I systematically explore this space, evaluating candidates on development sets and pruning unpromising branches. Across five NLP tasks (sentiment classification, question answering, summarisation, reason- ing, and natural language inference), I find that even shallow search configurations (beam width=2, depth=2) improve upon seed prompts on development sets. For instance, beam search achieves development accuracy gains from 0.40 to 0.80 on reasoning tasks, though test set improvements are more modest (0.20 to 0.50), indicating overfitting to the develop- ment heuristic. Analysis of successful optimisation paths reveals that transformations that make prompts concise appear most frequently, while verbosity operators are never selected. My results validate prompt optimization as a search problem and suggest that with greater computational resources and improved evaluation metrics, deeper exploration could yield more robust prompts that generalize beyond development sets. Code and implementation are available at [https://github.com/MaanasTaneja/PromptOptimiser].
☆ A Unified BERT-CNN-BiLSTM Framework for Simultaneous Headline Classification and Sentiment Analysis of Bangla News
In our daily lives, newspapers are an essential information source that impacts how the public talks about present-day issues. However, effectively navigating the vast amount of news content from different newspapers and online news portals can be challenging. Newspaper headlines with sentiment analysis tell us what the news is about (e.g., politics, sports) and how the news makes us feel (positive, negative, neutral). This helps us quickly understand the emotional tone of the news. This research presents a state-of-the-art approach to Bangla news headline classification combined with sentiment analysis applying Natural Language Processing (NLP) techniques, particularly the hybrid transfer learning model BERT-CNN-BiLSTM. We have explored a dataset called BAN-ABSA of 9014 news headlines, which is the first time that has been experimented with simultaneously in the headline and sentiment categorization in Bengali newspapers. Over this imbalanced dataset, we applied two experimental strategies: technique-1, where undersampling and oversampling are applied before splitting, and technique-2, where undersampling and oversampling are applied after splitting on the In technique-1 oversampling provided the strongest performance, both headline and sentiment, that is 78.57\% and 73.43\% respectively, while technique-2 delivered the highest result when trained directly on the original imbalanced dataset, both headline and sentiment, that is 81.37\% and 64.46\% respectively. The proposed model BERT-CNN-BiLSTM significantly outperforms all baseline models in classification tasks, and achieves new state-of-the-art results for Bangla news headline classification and sentiment analysis. These results demonstrate the importance of leveraging both the headline and sentiment datasets, and provide a strong baseline for Bangla text classification in low-resource.
☆ A Benchmark for Zero-Shot Belief Inference in Large Language Models
Beliefs are central to how humans reason, communicate, and form social connections, yet most computational approaches to studying them remain confined to narrow sociopolitical contexts and rely on fine-tuning for optimal performance. Despite the growing use of large language models (LLMs) across disciplines, how well these systems generalize across diverse belief domains remains unclear. We introduce a systematic, reproducible benchmark that evaluates the ability of LLMs to predict individuals' stances on a wide range of topics in a zero-shot setting using data from an online debate platform. The benchmark includes multiple informational conditions that isolate the contribution of demographic context and known prior beliefs to predictive success. Across several small- to medium-sized models, we find that providing more background information about an individual improves predictive accuracy, but performance varies substantially across belief domains. These findings reveal both the capacity and limitations of current LLMs to emulate human reasoning, advancing the study of machine behavior and offering a scalable framework for modeling belief systems beyond the sociopolitical sphere.
comment: 28 pages, 5 figures
☆ Toward Trustworthy Difficulty Assessments: Large Language Models as Judges in Programming and Synthetic Tasks
Large Language Models (LLMs) have demonstrated impressive capabilities in natural language and code generation, and are increasingly deployed as automatic judges of model outputs and learning activities. Yet, their behavior on structured tasks such as predicting the difficulty of competitive programming problems remains under-explored. We conduct a systematic comparison of GPT-4o, used purely as a natural-language difficulty assessor, against an interpretable Light-GBM ensemble trained on explicit numeric and textual features. On a dataset of 1,825 LeetCode problems labeled Easy, Medium, or Hard, LightGBM attains 86% accuracy, whereas GPT-4o reaches only 37.75%. Detailed analyses, including confusion matrices and SHAP-based interpretability, show that numeric constraints -- such as input size limits and acceptance rates -- play a crucial role in separating Hard problems from easier ones. By contrast, GPT-4o often overlooks these cues and exhibits a strong bias toward simpler categories. We further probe GPT-4o through a synthetic Hard-problem generation protocol. Surprisingly, GPT-4o labels almost all of its own synthetic Hard problems as Medium, contradicting its tendency to downgrade real Hard problems to Easy. Our findings connect to recent work on LLMs-as-judges and automatic difficulty estimation in programming and education, and highlight concrete failure modes that must be addressed before LLM-based judges can be considered trustworthy in competitive programming, educational platforms, or reinforcement-learning pipelines.
☆ Dealing with the Hard Facts of Low-Resource African NLP
Creating speech datasets, models, and evaluation frameworks for low-resource languages remains challenging given the lack of a broad base of pertinent experience to draw from. This paper reports on the field collection of 612 hours of spontaneous speech in Bambara, a low-resource West African language; the semi-automated annotation of that dataset with transcriptions; the creation of several monolingual ultra-compact and small models using the dataset; and the automatic and human evaluation of their output. We offer practical suggestions for data collection protocols, annotation, and model design, as well as evidence for the importance of performing human evaluation. In addition to the main dataset, multiple evaluation datasets, models, and code are made publicly available.
comment: 10 pages, 4 figures
☆ From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence
Large language models (LLMs) have fundamentally transformed automated software development by enabling direct translation of natural language descriptions into functional code, driving commercial adoption through tools like Github Copilot (Microsoft), Cursor (Anysphere), Trae (ByteDance), and Claude Code (Anthropic). While the field has evolved dramatically from rule-based systems to Transformer-based architectures, achieving performance improvements from single-digit to over 95\% success rates on benchmarks like HumanEval. In this work, we provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs, systematically examining the complete model life cycle from data curation to post-training through advanced prompting paradigms, code pre-training, supervised fine-tuning, reinforcement learning, and autonomous coding agents. We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder), critically examining the techniques, design decisions, and trade-offs. Further, we articulate the research-practice gap between academic research (e.g., benchmarks and tasks) and real-world deployment (e.g., software-related code tasks), including code correctness, security, contextual awareness of large codebases, and integration with development workflows, and map promising research directions to practical needs. Last, we conduct a series of experiments to provide a comprehensive analysis of code pre-training, supervised fine-tuning, and reinforcement learning, covering scaling law, framework selection, hyperparameter sensitivity, model architectures, and dataset comparisons.
☆ For Those Who May Find Themselves on the Red Team
This position paper argues that literary scholars must engage with large language model (LLM) interpretability research. While doing so will involve ideological struggle, if not out-right complicity, the necessity of this engagement is clear: the abiding instrumentality of current approaches to interpretability cannot be the only standard by which we measure interpretation with LLMs. One site at which this struggle could take place, I suggest, is the red team.
☆ MindEval: Benchmarking Language Models on Multi-turn Mental Health Support
Demand for mental health support through AI chatbots is surging, though current systems present several limitations, like sycophancy or overvalidation, and reinforcement of maladaptive beliefs. A core obstacle to the creation of better systems is the scarcity of benchmarks that capture the complexity of real therapeutic interactions. Most existing benchmarks either only test clinical knowledge through multiple-choice questions or assess single responses in isolation. To bridge this gap, we present MindEval, a framework designed in collaboration with Ph.D-level Licensed Clinical Psychologists for automatically evaluating language models in realistic, multi-turn mental health therapy conversations. Through patient simulation and automatic evaluation with LLMs, our framework balances resistance to gaming with reproducibility via its fully automated, model-agnostic design. We begin by quantitatively validating the realism of our simulated patients against human-generated text and by demonstrating strong correlations between automatic and human expert judgments. Then, we evaluate 12 state-of-the-art LLMs and show that all models struggle, scoring below 4 out of 6, on average, with particular weaknesses in problematic AI-specific patterns of communication. Notably, reasoning capabilities and model scale do not guarantee better performance, and systems deteriorate with longer interactions or when supporting patients with severe symptoms. We release all code, prompts, and human evaluation data.
☆ InstructAudio: Unified speech and music generation with natural language instruction
Text-to-speech (TTS) and text-to-music (TTM) models face significant limitations in instruction-based control. TTS systems usually depend on reference audio for timbre, offer only limited text-level attribute control, and rarely support dialogue generation. TTM systems are constrained by input conditioning requirements that depend on expert knowledge annotations. The high heterogeneity of these input control conditions makes them difficult to joint modeling with speech synthesis. Despite sharing common acoustic modeling characteristics, these two tasks have long been developed independently, leaving open the challenge of achieving unified modeling through natural language instructions. We introduce InstructAudio, a unified framework that enables instruction-based (natural language descriptions) control of acoustic attributes including timbre (gender, age), paralinguistic (emotion, style, accent), and musical (genre, instrument, rhythm, atmosphere). It supports expressive speech, music, and dialogue generation in English and Chinese. The model employs joint and single diffusion transformer layers with a standardized instruction-phoneme input format, trained on 50K hours of speech and 20K hours of music data, enabling multi-task learning and cross-modal alignment. Fig. 1 visualizes performance comparisons with mainstream TTS and TTM models, demonstrating that InstructAudio achieves optimal results on most metrics. To our best knowledge, InstructAudio represents the first instruction-controlled framework unifying speech and music generation. Audio samples are available at: https://qiangchunyu.github.io/InstructAudio/
☆ Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems AAAI 2026
The rapid advancement of Large Language Model (LLM)-driven multi-agent systems has significantly streamlined software developing tasks, enabling users with little technical expertise to develop executable applications. While these systems democratize software creation through natural language requirements, they introduce significant security risks that remain largely unexplored. We identify two risky scenarios: Malicious User with Benign Agents (MU-BA) and Benign User with Malicious Agents (BU-MA). We introduce the Implicit Malicious Behavior Injection Attack (IMBIA), demonstrating how multi-agent systems can be manipulated to generate software with concealed malicious capabilities beneath seemingly benign applications, and propose Adv-IMBIA as a defense mechanism. Evaluations across ChatDev, MetaGPT, and AgentVerse frameworks reveal varying vulnerability patterns, with IMBIA achieving attack success rates of 93%, 45%, and 71% in MU-BA scenarios, and 71%, 84%, and 45% in BU-MA scenarios. Our defense mechanism reduced attack success rates significantly, particularly in the MU-BA scenario. Further analysis reveals that compromised agents in the coding and testing phases pose significantly greater security risks, while also identifying critical agents that require protection against malicious user exploitation. Our findings highlight the urgent need for robust security measures in multi-agent software development systems and provide practical guidelines for implementing targeted, resource-efficient defensive strategies.
comment: Accepted by AAAI 2026 Alignment Track
☆ General Agentic Memory Via Deep Research
Memory is critical for AI agents, yet the widely-adopted static memory, aiming to create readily available memory in advance, is inevitably subject to severe information loss. To address this limitation, we propose a novel framework called \textbf{general agentic memory (GAM)}. GAM follows the principle of "\textbf{just-in time (JIT) compilation}" where it focuses on creating optimized contexts for its client at runtime while keeping only simple but useful memory during the offline stage. To this end, GAM employs a duo-design with the following components. 1) \textbf{Memorizer}, which highlights key historical information using a lightweight memory, while maintaining complete historical information within a universal page-store. 2) \textbf{Researcher}, which retrieves and integrates useful information from the page-store for its online request guided by the pre-constructed memory. This design allows GAM to effectively leverage the agentic capabilities and test-time scalability of frontier large language models (LLMs), while also facilitating end-to-end performance optimization through reinforcement learning. In our experimental study, we demonstrate that GAM achieves substantial improvement on various memory-grounded task completion scenarios against existing memory systems.
☆ Multi-Agent Collaborative Filtering: Orchestrating Users and Items for Agentic Recommendations
Agentic recommendations cast recommenders as large language model (LLM) agents that can plan, reason, use tools, and interact with users of varying preferences in web applications. However, most existing agentic recommender systems focus on generic single-agent plan-execute workflows or multi-agent task decomposition pipelines. Without recommendation-oriented design, they often underuse the collaborative signals in the user-item interaction history, leading to unsatisfying recommendation results. To address this, we propose the Multi-Agent Collaborative Filtering (MACF) framework for agentic recommendations, drawing an analogy between traditional collaborative filtering algorithms and LLM-based multi-agent collaboration. Specifically, given a target user and query, we instantiate similar users and relevant items as LLM agents with unique profiles. Each agent is able to call retrieval tools, suggest candidate items, and interact with other agents. Different from the static preference aggregation in traditional collaborative filtering, MACF employs a central orchestrator agent to adaptively manage the collaboration between user and item agents via dynamic agent recruitment and personalized collaboration instruction. Experimental results on datasets from three different domains show the advantages of our MACF framework compared to strong agentic recommendation baselines.
☆ SmolKalam: Ensemble Quality-Filtered Translation at Scale for High Quality Arabic Post-Training Data
Although the community has tackled the acquisition of high-quality Arabic pretraining data, we still lack large-scale, multi-turn Arabic datasets that include reasoning and tool calling. Naive translation can work at the pretraining scale, but post-training demands much higher quality, which requires a stricter approach to dataset curation. In this work, we introduce SmolKalam, a translation of Smoltalk2 that uses a multi-model ensemble translation pipeline, applies quality filtering, and examines effective translation techniques for traditional decoder-only models through ablations.
comment: Work in progress
☆ Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models
Mechanistic interpretability (MI) seeks to uncover how language models (LMs) implement specific behaviors, yet measuring progress in MI remains challenging. The recently released Mechanistic Interpretability Benchmark (MIB; Mueller et al., 2025) provides a standardized framework for evaluating circuit and causal variable localization. Building on this foundation, the BlackboxNLP 2025 Shared Task extends MIB into a community-wide reproducible comparison of MI techniques. The shared task features two tracks: circuit localization, which assesses methods that identify causally influential components and interactions driving model behavior, and causal variable localization, which evaluates approaches that map activations into interpretable features. With three teams spanning eight different methods, participants achieved notable gains in circuit localization using ensemble and regularization strategies for circuit discovery. With one team spanning two methods, participants achieved significant gains in causal variable localization using low-dimensional and non-linear projections to featurize activation vectors. The MIB leaderboard remains open; we encourage continued work in this standard evaluation framework to measure progress in MI research going forward.
☆ Towards Robust and Fair Next Visit Diagnosis Prediction under Noisy Clinical Notes with Large Language Models AAAI
A decade of rapid advances in artificial intelligence (AI) has opened new opportunities for clinical decision support systems (CDSS), with large language models (LLMs) demonstrating strong reasoning abilities on timely medical tasks. However, clinical texts are often degraded by human errors or failures in automated pipelines, raising concerns about the reliability and fairness of AI-assisted decision-making. Yet the impact of such degradations remains under-investigated, particularly regarding how noise-induced shifts can heighten predictive uncertainty and unevenly affect demographic subgroups. We present a systematic study of state-of-the-art LLMs under diverse text corruption scenarios, focusing on robustness and equity in next-visit diagnosis prediction. To address the challenge posed by the large diagnostic label space, we introduce a clinically grounded label-reduction scheme and a hierarchical chain-of-thought (CoT) strategy that emulates clinicians' reasoning. Our approach improves robustness and reduces subgroup instability under degraded inputs, advancing the reliable use of LLMs in CDSS. We release code at https://github.com/heejkoo9/NECHOv3.
comment: Accepted by the Association for the Advancement of Artificial Intelligence (AAAI) 2026 1st Workshop on Safe, Ethical, Certified, Uncertainty-aware, Robust, and Explainable AI for Health (SECURE-AI4H)
☆ Tu crois que c'est vrai ? Diversite des regimes d'enonciation face aux fake news et mecanismes d'autoregulation conversationnelle
This thesis addresses two paradoxes: (1) why empirical studies find that fake news represent only a small share of the information consulted and shared on social media despite the absence of editorial control or journalistic norms, and (2) how political polarization has intensified even though users do not appear especially receptive to fake news. To investigate these issues, two complementary studies were carried out on Twitter and Facebook, combining quantitative analyses of digital traces with online observation and interviews. This mixed-methods design avoids reducing users to single reactions to identified fake items and instead examines the variety of practices across different interactional situations, online and offline, while recording socio-demographic traits. The first study mapped users who shared at least one item labeled fake by fact-checkers in the French Twittersphere. The second used a corpus of items flagged by Facebook users to study reactions to statements whose epistemic status is uncertain. Three main findings emerge. First, sharing fake news is concentrated among a limited group of users who are not less educated or cognitively disadvantaged but are more politicized and critical of institutions; owing to their high activity and prolific sharing, they can help set the agenda for their political camp. Second, exposed users can deploy varying forms of critical distance depending on their social position and the interactional norms of the situations they inhabit: either discursive caution (prudence énonciative) or interventions ('points d'arrêt') that express disagreement or corrections. Third, these forms of critical distance seldom yield genuine deliberative debates or agonistic pluralism; rather, they often produce dialogues of the deaf among a small, particularly active minority.
comment: in French language
☆ OmniStruct: Universal Text-to-Structure Generation across Diverse Schemas
The ability of Large Language Models (LLMs) to generate structured outputs that follow arbitrary schemas is crucial to a wide range of downstream tasks that require diverse structured representations of results such as information extraction, table generation, and function calling. While modern LLMs excel in generating unstructured responses in natural language, whether this advancement translates to a strong performance on text-to-structure tasks remains unclear. To bridge this gap, we first introduce OmniStruct, a comprehensive benchmark for assessing LLMs' capabilities on diverse text-to-structure tasks such as information extraction, table generation, and function calling. We build OmniStruct by identifying existing datasets across a wide range of tasks that are suitable for a structured answer format, and adapting them under a unified text-to-structure problem setting. To facilitate the development of efficient text-to-structure models, we collect high-quality training data via synthetic task generation. Without using any supervised data for OmniStruct tasks, our experiments demonstrate the possibility of fine-tuning much smaller models on synthetic data into universal structured generation models that can rival the performance of GPT-4o.
☆ Gradient Masters at BLP-2025 Task 1: Advancing Low-Resource NLP for Bengali using Ensemble-Based Adversarial Training for Hate Speech Detection AACL
This paper introduces the approach of "Gradient Masters" for BLP-2025 Task 1: "Bangla Multitask Hate Speech Identification Shared Task". We present an ensemble-based fine-tuning strategy for addressing subtasks 1A (hate-type classification) and 1B (target group classification) in YouTube comments. We propose a hybrid approach on a Bangla Language Model, which outperformed the baseline models and secured the 6th position in subtask 1A with a micro F1 score of 73.23% and the third position in subtask 1B with 73.28%. We conducted extensive experiments that evaluated the robustness of the model throughout the development and evaluation phases, including comparisons with other Language Model variants, to measure generalization in low-resource Bangla hate speech scenarios and data set coverage. In addition, we provide a detailed analysis of our findings, exploring misclassification patterns in the detection of hate speech.
comment: 6 pages, 2 figures, 4 tables. Accepted at the Second International Workshop on Bangla Language Processing (BLP-2025) co-located with AACL-IJCNLP 2025. Ranked 6th (Subtask 1A, 73.23% micro F1) and 3rd (Subtask 1B, 73.28% micro F1) on the official leaderboard
☆ Path-Constrained Retrieval: A Structural Approach to Reliable LLM Agent Reasoning Through Graph-Scoped Semantic Search
Large Language Model agents often retrieve context from knowledge bases that lack structural consistency with the agent's current reasoning state, leading to incoherent reasoning chains. We introduce Path-Constrained Retrieval (PCR), a retrieval method that combines structural graph constraints with semantic search to ensure retrieved information maintains logical relationships within a knowledge graph. PCR restricts the search space to nodes reachable from an anchor node, preventing retrieval of structurally disconnected information that may lead to inconsistent reasoning. We evaluate PCR on PathRAG-6, a benchmark spanning six domains with 180 nodes and 360 edges. Our results show that PCR achieves full structural consistency compared to 24-32 percent in baseline methods, while maintaining strong relevance scores. On the technology domain, PCR obtains full relevance at rank 10 with full structural consistency, significantly outperforming vector search and hybrid retrieval. PCR reduces the average graph distance of retrieved context by 78 percent compared to baselines, demonstrating retrieval of more structurally consistent information. These findings suggest that path-constrained retrieval is an effective approach for improving the reliability and coherence of LLM agent reasoning systems.
comment: 10 pages
☆ Table Comprehension in Building Codes using Vision Language Models and Domain-Specific Fine-Tuning
Building codes contain critical information for ensuring safety, regulatory compliance, and informed decision-making in construction and engineering. Automated question answering systems over such codes enable quick and accurate access to specific regulatory clauses, improving efficiency and reducing errors. Retrieval-Augmented Generation (RAG) systems are essential for this task as they combine the precision of information retrieval with the generative capabilities of language models. However, tabular data are challenging to extract as they often involve complex layouts, merged cells, multi-row headers, and embedded semantic relationships that are not easily captured by traditional natural language processing techniques and Vision Language Models (VLMs). This paper explores and compares two methods for extracting information from tabular data in building codes using several pre-trained VLMs. First, a direct input method is used, where the image of the page is input directly into the VLMs, which are then tasked with answering questions based on the image. Second, an indirect input method is introduced, which involves converting an image of a page containing tables into the LaTeX code and then answering inquires based on the LaTeX-based input. The experiments find that the direct input method generally resulted in higher accuracy than the indirect input method. To further improve the performance, we fine-tuned each VLM using Low Rank Adaptation (LoRA) on a domain-specific tabular dataset. The fine-tuned models exhibited substantial improvements, with Qwen2.5-VL-3B-Instruct achieving relative accuracy gains exceeding 100%. Our results highlight the potential of parameter-efficient fine-tuning methods to adapt powerful VLMs for understanding complex structured data in specialized fields, such as building code interpretation and regulatory compliance.
☆ "AGI" team at SHROOM-CAP: Data-Centric Approach to Multilingual Hallucination Detection using XLM-RoBERTa AACL
The detection of hallucinations in multilingual scientific text generated by Large Language Models (LLMs) presents significant challenges for reliable AI systems. This paper describes our submission to the SHROOM-CAP 2025 shared task on scientific hallucination detection across 9 languages. Unlike most approaches that focus primarily on model architecture, we adopted a data-centric strategy that addressed the critical issue of training data scarcity and imbalance. We unify and balance five existing datasets to create a comprehensive training corpus of 124,821 samples (50% correct, 50% hallucinated), representing a 172x increase over the original SHROOM training data. Our approach fine-tuned XLM-RoBERTa-Large with 560 million parameters on this enhanced dataset, achieves competitive performance across all languages, including \textbf{2nd place in Gujarati} (zero-shot language) with Factuality F1 of 0.5107, and rankings between 4th-6th place across the remaining 8 languages. Our results demonstrate that systematic data curation can significantly outperform architectural innovations alone, particularly for low-resource languages in zero-shot settings.
comment: Accepted to the 1st Workshop on Confabulation, Hallucinations & Overgeneration in Multilingual and Practical Settings (CHOMPS) at AACL-IJCNLP 2025
☆ From Archives to Decisions: Multi-Agent Pharmaceutical Co-Scientist for Traceable Drug Discovery and Reverse Translation
Pharmaceutical research and development has accumulated vast, heterogeneous archives of data. Much of this knowledge stems from discontinued programs, and reusing these archives is invaluable for reverse translation. However, in practice, such reuse is often infeasible. In this work, we introduce DiscoVerse, a multi-agent co-scientist designed to support pharmaceutical research and development. The system implements semantic retrieval, cross-document linking, and auditable synthesis on a large historical corpus from Roche. To validate our approach at real-world scale, we selected a subset of 180 molecules from the Roche research repositories, covering over 0.87 billion BPE tokens and more than four decades of research. Given that automated evaluation metrics are poorly aligned with scientific utility, we evaluate the performance of DiscoVerse using blinded expert evaluation of source-linked outputs. To our knowledge, this is the first agentic framework systematically assessed on real pharmaceutical data for reverse translation, enabled by authorized access to confidential, end-to-end drug-development archives. Our contributions include role-specialized agent designs aligned with scientist workflows; human-in-the-loop support for reverse translation; expert evaluation; and a large-scale demonstration showing promising answer accuracy and decision-making insights. In brief, across seven benchmark queries covering 180 molecules, DiscoVerse achieved near-perfect recall ($\geq 0.99$) with moderate precision ($0.71-0.91$), while qualitative assessments of discontinuation rationale and organ-specific toxicity showed faithful, source-linked synthesis across preclinical and clinical evidence.
comment: 22 pages, 4 figures, 3 tables
♻ ☆ LLMs4All: A Review of Large Language Models Across Academic Disciplines
Cutting-edge Artificial Intelligence (AI) techniques keep reshaping our view of the world. For example, Large Language Models (LLMs) based applications such as ChatGPT have shown the capability of generating human-like conversation on extensive topics. Due to the impressive performance on a variety of language-related tasks (e.g., open-domain question answering, translation, and document summarization), one can envision the far-reaching impacts that can be brought by the LLMs with broader real-world applications (e.g., customer service, education and accessibility, and scientific discovery). Inspired by their success, this paper will offer an overview of state-of-the-art LLMs and their integration into a wide range of academic disciplines, including: (1) arts, letters, and law (e.g., history, philosophy, political science, arts and architecture, law), (2) economics and business (e.g., finance, economics, accounting, marketing), and (3) science and engineering (e.g., mathematics, physics and mechanical engineering, chemistry and chemical engineering, life sciences and bioengineering, earth sciences and civil engineering, computer science and electrical engineering). Integrating humanity and technology, in this paper, we will explore how LLMs are shaping research and practice in these fields, while also discussing key limitations, open challenges, and future directions in the era of generative AI. The review of how LLMs are engaged across disciplines-along with key observations and insights-can help researchers and practitioners interested in exploiting LLMs to advance their works in diverse real-world applications.
♻ ☆ Non-Linear Scoring Model for Translation Quality Evaluation
Analytic Translation Quality Evaluation (TQE), based on Multidimensional Quality Metrics (MQM), traditionally uses a linear error-to-penalty scale calibrated to a reference sample of 1000-2000 words. However, linear extrapolation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, producing misalignment with expert intuition. Building on the Multi-Range framework, this paper presents a calibrated, non-linear scoring model that better reflects how human content consumers perceive translation quality across samples of varying length. Empirical data from three large-scale enterprise environments shows that acceptable error counts grow logarithmically, not linearly, with sample size. Psychophysical and cognitive evidence, including the Weber-Fechner law and Cognitive Load Theory, supports this premise by explaining why the perceptual impact of additional errors diminishes while the cognitive burden grows with scale. We propose a two-parameter model E(x) = a * ln(1 + b * x), a, b > 0, anchored to a reference tolerance and calibrated from two tolerance points using a one-dimensional root-finding step. The model yields an explicit interval within which the linear approximation stays within +/-20 percent relative error and integrates into existing evaluation workflows with only a dynamic tolerance function added. The approach improves interpretability, fairness, and inter-rater reliability across both human and AI-generated translations. By operationalizing a perceptually valid scoring paradigm, it advances translation quality evaluation toward more accurate and scalable assessment. The model also provides a stronger basis for AI-based document-level evaluation aligned with human judgment. Implementation considerations for CAT/LQA systems and implications for human and AI-generated text evaluation are discussed.
comment: ongoing work, 38 pages
♻ ☆ Time-To-Inconsistency: A Survival Analysis of Large Language Model Robustness to Adversarial Attacks
Large Language Models (LLMs) have revolutionized conversational AI, yet their robustness in extended multi-turn dialogues remains poorly understood. Existing evaluation frameworks focus on static benchmarks and single-turn assessments, failing to capture the temporal dynamics of conversational degradation that characterize real-world interactions. In this work, we present a large-scale survival analysis of conversational robustness, modeling failure as a time-to-event process over 36,951 turns from 9 state-of-the-art LLMs on the MT-Consistency benchmark. Our framework combines Cox proportional hazards, Accelerated Failure Time (AFT), and Random Survival Forest models with simple semantic drift features. We find that abrupt prompt-to-prompt semantic drift sharply increases the hazard of inconsistency, whereas cumulative drift is counterintuitively \emph{protective}, suggesting adaptation in conversations that survive multiple shifts. AFT models with model-drift interactions achieve the best combination of discrimination and calibration, and proportional hazards checks reveal systematic violations for key drift covariates, explaining the limitations of Cox-style modeling in this setting. Finally, we show that a lightweight AFT model can be turned into a turn-level risk monitor that flags most failing conversations several turns before the first inconsistent answer while keeping false alerts modest. These results establish survival analysis as a powerful paradigm for evaluating multi-turn robustness and for designing practical safeguards for conversational AI systems.
♻ ☆ A Novel Framework for Augmenting Rating Scale Tests with LLM-Scored Text Data
Psychological assessments are dominated by rating scales, which cannot capture the nuance in natural language. Efforts to supplement them with qualitative text have relied on labelled datasets or expert rubrics, limiting scalability. We introduce a framework that avoids this reliance: large language models (LLMs) score free-text responses with simple prompts to produce candidate LLM items, from which we retain those that yield the most test information when co-calibrated with a baseline scale. Using depression as a case study, we developed and tested the method in upper-secondary students (n=693) and a matched synthetic dataset (n=3,000). Results on held-out test sets showed that augmenting a 19-item scale with LLM items improved its precision, accuracy, and convergent validity. Further, the test information gain matched that of adding as many as 16 rating-scale items. This framework leverages the increasing availability of transcribed language to enhance psychometric measures, with applications in clinical health and beyond.
♻ ☆ Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs ICLR 2025
LLMs are commonly trained with a learning rate (LR) warmup, followed by cosine decay to 10% of the maximum (10x decay). In a large-scale empirical study, we show that under an optimal peak LR, a simple linear decay-to-zero (D2Z) schedule consistently outperforms other schedules when training at compute-optimal dataset sizes. D2Z is superior across a range of model sizes, batch sizes, datasets, and vocabularies. Benefits increase as dataset size increases. Leveraging a novel interpretation of AdamW as an exponential moving average of weight updates, we show how linear D2Z optimally balances the demands of early training (moving away from initial conditions) and late training (averaging over more updates in order to mitigate gradient noise). In experiments, a 610M-parameter model trained for 80 tokens-per-parameter (TPP) using D2Z achieves lower loss than when trained for 200 TPP using 10x decay, corresponding to an astonishing 60% compute savings. Models such as Llama2-7B, trained for 286 TPP with 10x decay, could likely have saved a majority of compute by training with D2Z.
comment: ICLR 2025
♻ ☆ Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training NeurIPS 2025
Efficient LLM pre-training requires well-tuned hyperparameters (HPs), including learning rate $η$ and weight decay $λ$. We study scaling laws for HPs: formulas for how to scale HPs as we scale model size N, dataset size D, and batch size B. Recent work suggests the AdamW timescale, $τ= B/(ηλD)$, should remain constant across training settings, and we verify the implication that optimal $λ$ scales linearly with B, for a fixed N and D. However, as N and D scale, we show optimal $τ$ obeys a precise power law in the tokens-per-parameter ratio, D/N. This law thus provides a method to accurately predict $λ$opt in advance of large-scale training. We also study scaling laws for optimal batch size Bopt (the B enabling lowest loss at a given N,D) and critical batch size Bcrit (the B beyond which further data parallelism becomes ineffective). In contrast to prior work, we find both Bopt and Bcrit scale as power laws in D, independent of model size, N. Finally, we analyze how these findings inform the real-world selection of Pareto-optimal N and D under dual training time and compute objectives. All experiments were run on Cerebras CS-3 systems.
comment: NeurIPS 2025
♻ ☆ Lessons from Studying Two-Hop Latent Reasoning
Large language models can use chain-of-thought (CoT) to externalize reasoning, potentially enabling oversight of capable LLM agents. Prior work has shown that models struggle at two-hop question-answering without CoT. This capability is so basic that if it was a fundamental limitation, it would imply that many complex agentic tasks would similarly require CoT. We investigate LLM latent reasoning capabilities using two-hop question answering as a case study. Previous work on the gap between latent and externalized two-hop reasoning produced mixed evidence with inconclusive results. In this paper, we introduce a controlled setting for investigating two-hop reasoning in LLMs, where a positive result provides definitive evidence for latent reasoning. We fine-tune LLMs (including Llama 3 8B and GPT-4o) on synthetic facts and test two-hop reasoning over these facts. By using synthetic facts, we rule out memorization and reasoning shortcuts as explanations for two-hop performance. We observe a nuanced picture: Models fail to compose two synthetic facts, but can succeed when one fact is synthetic and the other is natural. These results demonstrate that LLMs are undeniably capable of latent two-hop reasoning, although it remains unclear how this ability scales with model size. Finally, we highlight a lesson for researchers studying LLM reasoning: when drawing conclusions about LLM latent reasoning, one must be careful to avoid both spurious successes (that stem from memorization and reasoning shortcuts) and spurious failures (that may stem from artificial experimental setups, divorced from training setups of frontier LLMs).
♻ ☆ ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?
Frontier AI agents show increasing promise as scientific research assistants, and may eventually be useful for extended, open-ended research workflows. However, in order to use agents for novel research, we must first assess the underlying faithfulness and correctness of their work. To evaluate agents as research assistants, we introduce ReplicationBench, an evaluation framework that tests whether agents can replicate entire research papers drawn from the astrophysics literature. Astrophysics, where research relies heavily on archival data and computational study while requiring little real-world experimentation, is a particularly useful testbed for AI agents in scientific research. We split each paper into tasks which require agents to replicate the paper's core contributions, including the experimental setup, derivations, data analysis, and codebase. Each task is co-developed with the original paper authors and targets a key scientific result, enabling objective evaluation of both faithfulness (adherence to original methods) and correctness (technical accuracy of results). ReplicationBench is extremely challenging for current frontier language models: even the best-performing language models score under 20%. We analyze ReplicationBench trajectories in collaboration with domain experts and find a rich, diverse set of failure modes for agents in scientific research. ReplicationBench establishes the first benchmark of paper-scale, expert-validated astrophysics research tasks, reveals insights about agent performance generalizable to other domains of data-driven science, and provides a scalable framework for measuring AI agents' reliability in scientific research.
♻ ☆ Assessing Historical Structural Oppression Worldwide via Rule-Guided Prompting of Large Language Models
Traditional efforts to measure historical structural oppression struggle with cross-national validity due to the unique, locally specified histories of exclusion, colonization, and social status in each country, and often have relied on structured indices that privilege material resources while overlooking lived, identity-based exclusion. We introduce a novel framework for oppression measurement that leverages Large Language Models (LLMs) to generate context-sensitive scores of lived historical disadvantage across diverse geopolitical settings. Using unstructured self-identified ethnicity utterances from a multilingual COVID-19 global study, we design rule-guided prompting strategies that encourage models to produce interpretable, theoretically grounded estimations of oppression. We systematically evaluate these strategies across multiple state-of-the-art LLMs. Our results demonstrate that LLMs, when guided by explicit rules, can capture nuanced forms of identity-based historical oppression within nations. This approach provides a complementary measurement tool that highlights dimensions of systemic exclusion, offering a scalable, cross-cultural lens for understanding how oppression manifests in data-driven research and public health contexts. To support reproducible evaluation, we release an open-sourced benchmark dataset for assessing LLMs on oppression measurement (https://github.com/chattergpt/HSO-Bench).
comment: To appear in the 2025 IEEE International Conference on Big Data (IEEE BigData 2025)
♻ ☆ PsychiatryBench: A Multi-Task Benchmark for LLMs in Psychiatry
Large language models (LLMs) offer significant potential in enhancing psychiatric practice, from improving diagnostic accuracy to streamlining clinical documentation and therapeutic support. However, existing evaluation resources heavily rely on small clinical interview corpora, social media posts, or synthetic dialogues, which limits their clinical validity and fails to capture the full complexity of diagnostic reasoning. In this work, we introduce PsychiatryBench, a rigorously curated benchmark grounded exclusively in authoritative, expert-validated psychiatric textbooks and casebooks. PsychiatryBench comprises eleven distinct question-answering tasks ranging from diagnostic reasoning and treatment planning to longitudinal follow-up, management planning, clinical approach, sequential case analysis, and multiple-choice/extended matching formats totaling 5,188 expert-annotated items. {\color{red}We evaluate a diverse set of frontier LLMs (including Google Gemini, DeepSeek, Sonnet 4.5, and GPT 5) alongside leading open-source medical models such as MedGemma using both conventional metrics and an "LLM-as-judge" similarity scoring framework. Our results reveal substantial gaps in clinical consistency and safety, particularly in multi-turn follow-up and management tasks, underscoring the need for specialized model tuning and more robust evaluation paradigms. PsychiatryBench offers a modular, extensible platform for benchmarking and improving LLM performance in mental health applications.
♻ ☆ VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format
Recent researches on video large language models (VideoLLM) predominantly focus on model architectures and training datasets, leaving the interaction format between the user and the model under-explored. In existing works, users often interact with VideoLLMs by using the entire video and a query as input, after which the model generates a response. This interaction format constrains the application of VideoLLMs in scenarios such as live-streaming comprehension where videos do not end and responses are required in a real-time manner, and also results in unsatisfactory performance on time-sensitive tasks that requires localizing video segments. In this paper, we focus on a video-text duet interaction format. This interaction format is characterized by the continuous playback of the video, and both the user and the model can insert their text messages at any position during the video playback. When a text message ends, the video continues to play, akin to the alternative of two performers in a duet. We construct MMDuetIT, a video-text training dataset designed to adapt VideoLLMs to video-text duet interaction format. We also introduce the Multi-Answer Grounded Video Question Answering (MAGQA) task to benchmark the real-time response ability of VideoLLMs. Trained on MMDuetIT, MMDuet demonstrates that adopting the video-text duet interaction format enables the model to achieve significant improvements in various time-sensitive tasks (76% CIDEr on YouCook2 dense video captioning, 90\% mAP on QVHighlights highlight detection and 25% R@0.5 on Charades-STA temporal video grounding) with minimal training efforts, and also enable VideoLLMs to reply in a real-time manner as the video plays.
comment: 9 pages
♻ ☆ One SPACE to Rule Them All: Jointly Mitigating Factuality and Faithfulness Hallucinations in LLMs NIPS 2025
LLMs have demonstrated unprecedented capabilities in natural language processing, yet their practical deployment remains hindered by persistent factuality and faithfulness hallucinations. While existing methods address these hallucination types independently, they inadvertently induce performance trade-offs, as interventions targeting one type often exacerbate the other. Through empirical and theoretical analysis of activation space dynamics in LLMs, we reveal that these hallucination categories share overlapping subspaces within neural representations, presenting an opportunity for concurrent mitigation. To harness this insight, we propose SPACE, a unified framework that jointly enhances factuality and faithfulness by editing shared activation subspaces. SPACE establishes a geometric foundation for shared subspace existence through dual-task feature modeling, then identifies and edits these subspaces via a hybrid probe strategy combining spectral clustering and attention head saliency scoring. Experimental results across multiple benchmark datasets demonstrate the superiority of our approach.
comment: Accepted as NIPS 2025 poster
♻ ☆ Llama2Vec: Unsupervised Adaptation of Large Language Models for Dense Retrieval ACL 2024
Dense retrieval calls for discriminative embeddings to represent the semantic relationship between query and document. It may benefit from the using of large language models (LLMs), given LLMs' strong capability on semantic understanding. However, the LLMs are learned by auto-regression, whose working mechanism is completely different from representing whole text as one discriminative embedding. Thus, it is imperative to study how to adapt LLMs properly so that they can be effectively initialized as the backbone encoder for dense retrieval. In this paper, we propose a novel approach, called Llama2Vec, which performs unsupervised adaptation of LLM for its dense retrieval application. Llama2Vec consists of two pretext tasks: EBAE (Embedding-Based Auto-Encoding) and EBAR (Embedding-Based Auto-Regression), where the LLM is prompted to reconstruct the input sentence and predict the next sentence based on its text embeddings. Llama2Vec is simple, lightweight, but highly effective. It is used to adapt LLaMA-2-7B on the Wikipedia corpus. With a moderate steps of adaptation, it substantially improves the model's fine-tuned performances on a variety of dense retrieval benchmarks. Notably, it results in the new state-of-the-art performances on popular benchmarks, such as passage and document retrieval on MSMARCO, and zero-shot retrieval on BEIR. The model and source code will be made publicly available to facilitate the future research. Our model is available at https://github.com/FlagOpen/FlagEmbedding.
comment: ACL 2024
♻ ☆ Conversations: Love Them, Hate Them, Steer Them
Large Language Models (LLMs) demonstrate increasing conversational fluency, yet instilling them with nuanced, human-like emotional expression remains a significant challenge. Current alignment techniques often address surface-level output or require extensive fine-tuning. This paper demonstrates that targeted activation engineering can steer LLaMA 3.1-8B to exhibit more human-like emotional nuances. We first employ attribution patching to identify causally influential components, to find a key intervention locus by observing activation patterns during diagnostic conversational tasks. We then derive emotional expression vectors from the difference in the activations generated by contrastive text pairs (positive vs. negative examples of target emotions). Applying these vectors to new conversational prompts significantly enhances emotional characteristics: steered responses show increased positive sentiment (e.g., joy, trust) and more frequent first-person pronoun usage, indicative of greater personal engagement. Our findings offer a precise and interpretable method for controlling specific emotional attributes in LLMs, contributing to developing more aligned and empathetic conversational AI.
comment: We have created a new arXiv submission with a more up to date version of this paper at arXiv:2511.12832
♻ ☆ ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection
Hateful memes have emerged as a particularly challenging form of online abuse, motivating the development of automated detection systems. Most prior approaches rely on direct detection, producing only binary predictions. Such models fail to provide the context and explanations that real-world moderation requires. Recent Explain-then-Detect approaches, using Chain-of-Thought prompting or LMM agents, perform worse than simple SFT baselines, and even advanced post-training methods such as GRPO fail to close the gap. Our analysis identifies two key issues of such systems: important policy-relevant cues such as targets and attack types are not hypothesized by the model as a likely explanation; and the binary reward signal is insufficient to guide reasoning. To address these challenges, we propose ExPO-HM (Explain-then-Detect Policy Optimization for Hateful Memes), inspired by the training and evaluation process of human annotators. ExPO-HM combines SFT warmup, GRPO with curriculum learning, and Conditional Decision Entropy (CDE) as both metric and reward for reasoning quality. Across three hateful meme benchmarks, ExPO-HM achieves state-of-the-art performance on binary detection, fine-grained classification, and reasoning quality, with up to 15\% and 17\% F1 improvement over the GRPO and DPO baselines, respectively. By moving hateful meme detection from simple binary alarms to explanation-driven detection, ExPO-HM provides accurate, interpretable, and actionable moderation support.
comment: Preprint
♻ ☆ ReCode: Updating Code API Knowledge with Reinforcement Learning AAAI 2026
Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs' code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs' general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at https://github.com/zjunlp/ReCode.
comment: AAAI 2026
♻ ☆ LoKI: Low-damage Knowledge Implanting of Large Language Models AAAI-26
Fine-tuning adapts pretrained models for specific tasks but poses the risk of catastrophic forgetting (CF), where critical knowledge from pretraining is overwritten. To address the issue of CF in a general-purpose framework, we propose Low-damage Knowledge Implanting (LoKI), a parameter-efficient fine-tuning (PEFT) technique that utilizes recent mechanistic understanding of how knowledge is stored in transformer architectures. We compare LoKI against state-of-the-art PEFT methods in two real-world fine-tuning scenarios. The results show that LoKI demonstrates significantly better preservation of general capabilities. At the same time, its task-specific performance is comparable to or even surpasses that of full parameter fine-tuning and these PEFT methods across various model architectures. Our work bridges the mechanistic insights of LLMs' knowledge storage with practical fine-tuning objectives, enabling an effective balance between task-specific adaptation and the retention of general-purpose capabilities.
comment: AAAI-26 Oral
♻ ☆ Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning EMNLP
The effectiveness of instruction fine-tuning for Large Language Models is fundamentally constrained by the quality and efficiency of training datasets. This work introduces Low-Confidence Gold (LCG), a novel filtering framework that employs centroid-based clustering and confidence-guided selection for identifying valuable instruction pairs. Through a semi-supervised approach using a lightweight classifier trained on representative samples, LCG curates high-quality subsets while preserving data diversity. Experimental evaluation demonstrates that models fine-tuned on LCG-filtered subsets of 6K samples achieve superior performance compared to existing methods, with substantial improvements on MT-bench and consistent gains across comprehensive evaluation metrics. The framework's efficacy while maintaining model performance establishes a promising direction for efficient instruction tuning.
comment: Accepted to EMNLP Findings 2025
♻ ☆ Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the dense LLM, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.
comment: 47 pages,10 Figures, Project Website: https://idealistxy.github.io/Uni-MoE-v2.github.io/ Codes: https://github.com/HITsz-TMG/Uni-MoE
♻ ☆ UPLME: Uncertainty-Aware Probabilistic Language Modelling for Robust Empathy Regression
Noisy self-reported empathy scores challenge supervised learning for empathy regression. While many algorithms have been proposed for learning with noisy labels in textual classification problems, the regression counterpart is relatively under-explored. We propose UPLME, an uncertainty-aware probabilistic language modelling framework to capture label noise in empathy regression tasks. One of the novelties in UPLME is a probabilistic language model that predicts both empathy scores and heteroscedastic uncertainty, and is trained using Bayesian concepts with variational model ensembling. We further introduce two novel loss components: one penalises degenerate Uncertainty Quantification (UQ), and another enforces similarity between the input pairs on which empathy is being predicted. UPLME achieves state-of-the-art performance (Pearson Correlation Coefficient: $0.558\rightarrow0.580$ and $0.629\rightarrow0.634$) in terms of the performance reported in the literature on two public benchmarks with label noise. Through synthetic label noise injection, we demonstrate that UPLME is effective in distinguishing between noisy and clean samples based on the predicted uncertainty. UPLME further outperform (Calibration error: $0.571\rightarrow0.376$) a recent variational model ensembling-based UQ method designed for regression problems. Code is publicly available at https://github.com/hasan-rakibul/UPLME.
comment: Code available at https://github.com/hasan-rakibul/UPLME
♻ ☆ FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models NeurIPS 2025
Large vision-language models (LVLMs) excel at multimodal understanding but suffer from high computational costs due to redundant vision tokens. Existing pruning methods typically rely on single-layer attention scores to rank and prune redundant visual tokens to solve this inefficiency. However, as the interaction between tokens and layers is complicated, this raises a basic question: Is such a simple single-layer criterion sufficient to identify redundancy? To answer this question, we rethink the emergence of redundant visual tokens from a fundamental perspective: information flow, which models the interaction between tokens and layers by capturing how information moves between tokens across layers. We find (1) the CLS token acts as an information relay, which can simplify the complicated flow analysis; (2) the redundancy emerges progressively and dynamically via layer-wise attention concentration; and (3) relying solely on attention scores from single layers can lead to contradictory redundancy identification. Based on this, we propose FlowCut, an information-flow-aware pruning framework, mitigating the insufficiency of the current criterion for identifying redundant tokens and better aligning with the model's inherent behaviors. Extensive experiments show that FlowCut achieves superior results, outperforming SoTA by 1.6% on LLaVA-1.5-7B with 88.9% token reduction, and by 4.3% on LLaVA-NeXT-7B with 94.4% reduction, delivering 3.2x speed-up in the prefilling stage. Our code is available at https://github.com/TungChintao/FlowCut
comment: Accepted by NeurIPS 2025
♻ ☆ OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models
Since Multimodal Large Language Models (MLLMs) are increasingly being integrated into everyday tools and intelligent agents, growing concerns have arisen regarding their possible output of unsafe contents, ranging from toxic language and biased imagery to privacy violations and harmful misinformation. Current safety benchmarks remain highly limited in both modality coverage and performance evaluations, often neglecting the extensive landscape of content safety. In this work, we introduce OutSafe-Bench, the first most comprehensive content safety evaluation test suite designed for the multimodal era. OutSafe-Bench includes a large-scale dataset that spans four modalities, featuring over 18,000 bilingual (Chinese and English) text prompts, 4,500 images, 450 audio clips and 450 videos, all systematically annotated across nine critical content risk categories. In addition to the dataset, we introduce a Multidimensional Cross Risk Score (MCRS), a novel metric designed to model and assess overlapping and correlated content risks across different categories. To ensure fair and robust evaluation, we propose FairScore, an explainable automated multi-reviewer weighted aggregation framework. FairScore selects top-performing models as adaptive juries, thereby mitigating biases from single-model judgments and enhancing overall evaluation reliability. Our evaluation of nine state-of-the-art MLLMs reveals persistent and substantial safety vulnerabilities, underscoring the pressing need for robust safeguards in MLLMs.
♻ ☆ LLM4Cell: A Survey of Large Language and Agentic Models for Single-Cell Biology
Large language models (LLMs) and emerging agentic frameworks are beginning to transform single-cell biology by enabling natural-language reasoning, generative annotation, and multimodal data integration. However, progress remains fragmented across data modalities, architectures, and evaluation standards. LLM4Cell presents the first unified survey of 58 foundation and agentic models developed for single-cell research, spanning RNA, ATAC, multi-omic, and spatial modalities. We categorize these methods into five families-foundation, text-bridge, spatial, multimodal, epigenomic, and agentic-and map them to eight key analytical tasks including annotation, trajectory and perturbation modeling, and drug-response prediction. Drawing on over 40 public datasets, we analyze benchmark suitability, data diversity, and ethical or scalability constraints, and evaluate models across 10 domain dimensions covering biological grounding, multi-omics alignment, fairness, privacy, and explainability. By linking datasets, models, and evaluation domains, LLM4Cell provides the first integrated view of language-driven single-cell intelligence and outlines open challenges in interpretability, standardization, and trustworthy model development.
comment: 34 pages, 5 figures, 7 tables
♻ ☆ Spatial Knowledge Graph-Guided Multimodal Synthesis
Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. Our approach addresses this critical gap by providing a systematic framework for generating spatially coherent data. In this work, we introduce SKG2DATA, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2DATA employs an automated pipeline for constructing Spatial Knowledge Graph (SKG) that effectively captures human-like spatial cognition, including directional and distance relationships. These structured representations then serve as precise guidance for our integrated synthesis pipeline, where a diffusion model generates spatially-consistent images while a MLLM produces corresponding textual descriptions. The automated construction of SKG enables scalable generation of diverse yet realistic spatial configurations, overcoming the limitations of manual data collection and annotation. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, enhance the spatial perception and reasoning abilities of MLLMs markedly, albeit with a slight cost to their general capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence. Code is available at https://github.com/zjunlp/Knowledge2Data.
comment: IEEE/ACM Transactions on Audio, Speech and Language Processing
♻ ☆ BadGraph: A Backdoor Attack Against Latent Diffusion Model for Text-Guided Graph Generation
The rapid progress of graph generation has raised new security concerns, particularly regarding backdoor vulnerabilities. While prior work has explored backdoor attacks in image diffusion and unconditional graph generation, conditional, especially text-guided graph generation remains largely unexamined. This paper proposes BadGraph, a backdoor attack method against latent diffusion models for text-guided graph generation. BadGraph leverages textual triggers to poison training data, covertly implanting backdoors that induce attacker-specified subgraphs during inference when triggers appear, while preserving normal performance on clean inputs. Extensive experiments on four benchmark datasets (PubChem, ChEBI-20, PCDes, MoMu) demonstrate the effectiveness and stealth of the attack: less than 10% poisoning rate can achieves 50% attack success rate, while 24% suffices for over 80% success rate, with negligible performance degradation on benign samples. Ablation studies further reveal that the backdoor is implanted during VAE and diffusion training rather than pretraining. These findings reveal the security vulnerabilities in latent diffusion models of text-guided graph generation, highlight the serious risks in models' applications such as drug discovery and underscore the need for robust defenses against the backdoor attack in such diffusion models.
♻ ☆ Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection
Attention mechanisms have revolutionized several domains of artificial intelligence, such as natural language processing and computer vision, by enabling models to selectively focus on relevant parts of the input data. While recent work has characterized the optimization dynamics of gradient descent (GD) in attention-based models and the structural properties of its preferred solutions, less is known about more general optimization algorithms such as mirror descent (MD). In this paper, we investigate the convergence properties and implicit biases of a family of MD algorithms tailored for softmax attention mechanisms, with the potential function chosen as the $p$-th power of the $\ell_p$-norm. Specifically, we show that these algorithms converge in direction to a generalized hard-margin SVM with an $\ell_p$-norm objective when applied to a classification problem using a softmax attention model. Notably, our theoretical results reveal that the convergence rate is comparable to that of traditional GD in simpler models, despite the highly nonlinear and nonconvex nature of the present problem. Additionally, we delve into the joint optimization dynamics of the key-query matrix and the decoder, establishing conditions under which this complex joint optimization converges to their respective hard-margin SVM solutions. Lastly, our numerical experiments on real data demonstrate that MD algorithms improve generalization over standard GD and excel in optimal token selection.
♻ ☆ Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems
Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models (LLMs) to access multimodal knowledge bases containing both text and visual information such as charts, diagrams, and tables in financial documents. However, existing multimodal RAG systems rely on LLM-based summarization to convert images into text during preprocessing, storing only text representations in vector databases, which causes loss of contextual information and visual details critical for downstream retrieval and question answering. To address this limitation, we present a comprehensive comparative analysis of two retrieval approaches for multimodal RAG systems, including text-based chunk retrieval (where images are summarized into text before embedding) and direct multimodal embedding retrieval (where images are stored natively in the vector space). We evaluate all three approaches across 6 LLM models and a two multi-modal embedding models on a newly created financial earnings call benchmark comprising 40 question-answer pairs, each paired with 2 documents (1 image and 1 text chunk). Experimental results demonstrate that direct multimodal embedding retrieval significantly outperforms LLM-summary-based approaches, achieving absolute improvements of 13% in mean average precision (mAP@5) and 11% in normalized discounted cumulative gain. These gains correspond to relative improvements of 32% in mAP@5 and 20% in nDCG@5, providing stronger evidence of their practical impact. We additionally find that direct multimodal retrieval produces more accurate and factually consistent answers as measured by LLM-as-a-judge pairwise comparisons. We demonstrate that LLM summarization introduces information loss during preprocessing, whereas direct multimodal embeddings preserve visual context for retrieval and inference.
♻ ☆ Demystifying CLIP Data
Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data by filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in our pursuit of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution. Our experimental study rigorously isolates the model and training settings, concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP's data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining the same training budget, attains 72.4%. Our observations hold across various model sizes, exemplified by ViT-H achieving 80.5%, without any bells-and-whistles. Curation code and training data distribution on metadata is made available at https://github.com/facebookresearch/MetaCLIP.
comment: 17 pages. arXiv admin note: text overlap with arXiv:2103.00020 by other authors
Computers and Society
☆ No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases
Large Language Models (LLMs) inherit societal biases from their training data, potentially leading to harmful or unfair outputs. While various techniques aim to mitigate these biases, their effects are often evaluated only along the dimension of the bias being targeted. This work investigates the cross-category consequences of targeted bias mitigation. We study four bias mitigation techniques applied across ten models from seven model families, and we explore racial, religious, profession- and gender-related biases. We measure the impact of debiasing on model coherence and stereotypical preference using the StereoSet benchmark. Our results consistently show that while targeted mitigation can sometimes reduce bias in the intended dimension, it frequently leads to unintended and often negative consequences in others, such as increasing model bias and decreasing general coherence. These findings underscore the critical need for robust, multi-dimensional evaluation tools when examining and developing bias mitigation strategies to avoid inadvertently shifting or worsening bias along untargeted axes.
☆ Universality in Collective Intelligence on the Rubik's Cube
Progress in understanding expert performance is limited by the scarcity of quantitative data on long-term knowledge acquisition and deployment. Here we use the Rubik's Cube as a cognitive model system existing at the intersection of puzzle solving, skill learning, expert knowledge, cultural transmission, and group theory. By studying competitive cube communities, we find evidence for universality in the collective learning of the Rubik's Cube in both sighted and blindfolded conditions: expert performance follows exponential progress curves whose parameters reflect the delayed acquisition of algorithms that shorten solution paths. Blindfold solves form a distinct problem class from sighted solves and are constrained not only by expert knowledge but also by the skill improvements required to overcome short-term memory bottlenecks, a constraint shared with blindfold chess. Cognitive artifacts such as the Rubik's Cube help solvers navigate an otherwise enormous mathematical state space. In doing so, they sustain collective intelligence by integrating communal knowledge stores with individual expertise and skill, illustrating how expertise can, in practice, continue to deepen over the course of a single lifetime.
☆ Bridging the Divide: Gender, Diversity, and Inclusion Gaps in Data Science and Artificial Intelligence Across Academia and Industry in the majority and minority worlds
As Artificial Intelligence (AI) and Data Science (DS) become pervasive, addressing gender disparities and diversity gaps in their workforce is urgent. These rapidly evolving fields have been further impacted by the COVID-19 pandemic, which disproportionately affected women and minorities, exposing deep-seated inequalities. Both academia and industry shape these disciplines, making it essential to map disparities across sectors, occupations, and skill levels. The dominance of men in AI and DS reinforces gender biases in machine learning systems, creating a feedback loop of inequality. This imbalance is a matter of social and economic justice and an ethical challenge, demanding value-driven diversity. Root causes include unequal access to education, disparities in academic programs, limited government investments, and underrepresented communities' perceptions of elite opportunities. This chapter examines the participation of women and minorities in AI and DS, focusing on their representation in both industry and academia. Analyzing the existing dynamics seeks to uncover the collective and individual impacts on the lives of women and minority groups within these fields. Additionally, the chapter aims to propose actionable strategies to promote equity, diversity, and inclusion (DEI), fostering a more representative and supportive environment for all.
☆ Optimal Meal Schedule for a Local Nonprofit Using LLM-Aided Data Extraction
We present a data-driven pipeline developed in collaboration with the Power Packs Project, a nonprofit addressing food insecurity in local communities. The system integrates data extraction from PDFs, large language models for ingredient standardization, and binary integer programming to generate a 15-week recipe schedule that minimizes projected wholesale costs while meeting nutritional constraints. All 157 recipes were mapped to a nutritional database and assigned estimated and predicted costs using historical invoice data and category-specific inflation adjustments. The model effectively handles real-world price volatility and is structured for easy updates as new recipes or cost data become available. Optimization results show that constraint-based selection yields nutritionally balanced and cost-efficient plans under uncertainty. To facilitate real-time decision-making, we deployed a searchable web platform that integrates analytical models into daily operations by enabling staff to explore recipes by ingredient, category, or through an optimized meal plan.
comment: 12 pages, 4 figures, presented at 2025 INFORMS Data Science Workshop (Atlanta, Georgia, Oct. 25, 2025)
☆ UnWEIRDing LLM Entity Recommendations
Large Language Models have been widely been adopted by users for writing tasks such as sentence completions. While this can improve writing efficiency, prior research shows that LLM-generated suggestions may exhibit cultural biases which may be difficult for users to detect, especially in educational contexts for non-native English speakers. While such prior work has studied the biases in LLM moral value alignment, we aim to investigate cultural biases in LLM recommendations for real-world entities. To do so, we use the WEIRD (Western, Educated, Industrialized, Rich and Democratic) framework to evaluate recommendations by various LLMs across a dataset of fine-grained entities, and apply pluralistic prompt-based strategies to mitigate these biases. Our results indicate that while such prompting strategies do reduce such biases, this reduction is not consistent across different models, and recommendations for some types of entities are more biased than others.
☆ Tu crois que c'est vrai ? Diversite des regimes d'enonciation face aux fake news et mecanismes d'autoregulation conversationnelle
This thesis addresses two paradoxes: (1) why empirical studies find that fake news represent only a small share of the information consulted and shared on social media despite the absence of editorial control or journalistic norms, and (2) how political polarization has intensified even though users do not appear especially receptive to fake news. To investigate these issues, two complementary studies were carried out on Twitter and Facebook, combining quantitative analyses of digital traces with online observation and interviews. This mixed-methods design avoids reducing users to single reactions to identified fake items and instead examines the variety of practices across different interactional situations, online and offline, while recording socio-demographic traits. The first study mapped users who shared at least one item labeled fake by fact-checkers in the French Twittersphere. The second used a corpus of items flagged by Facebook users to study reactions to statements whose epistemic status is uncertain. Three main findings emerge. First, sharing fake news is concentrated among a limited group of users who are not less educated or cognitively disadvantaged but are more politicized and critical of institutions; owing to their high activity and prolific sharing, they can help set the agenda for their political camp. Second, exposed users can deploy varying forms of critical distance depending on their social position and the interactional norms of the situations they inhabit: either discursive caution (prudence énonciative) or interventions ('points d'arrêt') that express disagreement or corrections. Third, these forms of critical distance seldom yield genuine deliberative debates or agonistic pluralism; rather, they often produce dialogues of the deaf among a small, particularly active minority.
comment: in French language
☆ Privacy Concerns and ChatGPT: Exploring Online Discourse through the Lens of Information Practice on Reddit
As millions of people use ChatGPT for tasks such as education, writing assistance, and health advice, concerns have grown about how personal prompts and data are stored and used. This study explores how Reddit users collectively negotiate and respond to these privacy concerns. Posts were collected from three major subreddits -- r/Chatgpt, r/privacy, and r/OpenAI -- between November 2022 and May 2025. An iterative keyword search followed by manual screening resulted in a final dataset of 426 posts and 1,900 comments. Using information practice as the theoretical lens, we conducted a qualitative thematic analysis to identify collective practices of risk negotiation, validated with BERTopic topic modeling to ensure thematic saturation. Findings revealed risk signaling, norm-setting, and resignation as dominant discourses, and collective troubleshooting and advocacy for privacy-preserving alternatives as key adaptive practices. Reddit functions as a site of collective sense-making where users surface risks, establish informal norms, and share strategies for mitigating privacy threats, offering insights for AI design and privacy literacy initiatives.
comment: Accepted as a poster at the iConference, 2026
☆ Analyzing and Optimizing the Distribution of Blood Lead Level Testing for Children in New York City: A Data-Driven Approach
This study investigates blood lead level (BLL) rates and testing among children under six years of age across the 42 neighborhoods in New York City from 2005 to 2021. Despite a citywide general decline in BLL rates, disparities at the neighborhood level persist and are not addressed in the official reports, highlighting the need for this comprehensive analysis. In this paper, we analyze the current BLL testing distribution and cluster the neighborhoods using a k-medoids clustering algorithm. We propose an optimized approach that improves resource allocation efficiency by accounting for case incidences and neighborhood risk profiles using a grid search algorithm. Our findings demonstrate statistically significant improvements in case detection and enhanced fairness by focusing on under-served and high-risk groups. Additionally, we propose actionable recommendations to raise awareness among parents, including outreach at local daycare centers and kindergartens, among other venues.
☆ Can LLMs Help Allocate Public Health Resources? A Case Study on Childhood Lead Testing
Public health agencies face critical challenges in identifying high-risk neighborhoods for childhood lead exposure with limited resources for outreach and intervention programs. To address this, we develop a Priority Score integrating untested children proportions, elevated blood lead prevalence, and public health coverage patterns to support optimized resource allocation decisions across 136 neighborhoods in Chicago, New York City, and Washington, D.C. We leverage these allocation tasks, which require integrating multiple vulnerability indicators and interpreting empirical evidence, to evaluate whether large language models (LLMs) with agentic reasoning and deep research capabilities can effectively allocate public health resources when presented with structured allocation scenarios. LLMs were tasked with distributing 1,000 test kits within each city based on neighborhood vulnerability indicators. Results reveal significant limitations: LLMs frequently overlooked neighborhoods with highest lead prevalence and largest proportions of untested children, such as West Englewood in Chicago, while allocating disproportionate resources to lower-priority areas like Hunts Point in New York City. Overall accuracy averaged 0.46, reaching a maximum of 0.66 with ChatGPT 5 Deep Research. Despite their marketed deep research capabilities, LLMs struggled with fundamental limitations in information retrieval and evidence-based reasoning, frequently citing outdated data and allowing non-empirical narratives about neighborhood conditions to override quantitative vulnerability indicators.
♻ ☆ A Novel Framework for Augmenting Rating Scale Tests with LLM-Scored Text Data
Psychological assessments are dominated by rating scales, which cannot capture the nuance in natural language. Efforts to supplement them with qualitative text have relied on labelled datasets or expert rubrics, limiting scalability. We introduce a framework that avoids this reliance: large language models (LLMs) score free-text responses with simple prompts to produce candidate LLM items, from which we retain those that yield the most test information when co-calibrated with a baseline scale. Using depression as a case study, we developed and tested the method in upper-secondary students (n=693) and a matched synthetic dataset (n=3,000). Results on held-out test sets showed that augmenting a 19-item scale with LLM items improved its precision, accuracy, and convergent validity. Further, the test information gain matched that of adding as many as 16 rating-scale items. This framework leverages the increasing availability of transcribed language to enhance psychometric measures, with applications in clinical health and beyond.
♻ ☆ Assessing Historical Structural Oppression Worldwide via Rule-Guided Prompting of Large Language Models
Traditional efforts to measure historical structural oppression struggle with cross-national validity due to the unique, locally specified histories of exclusion, colonization, and social status in each country, and often have relied on structured indices that privilege material resources while overlooking lived, identity-based exclusion. We introduce a novel framework for oppression measurement that leverages Large Language Models (LLMs) to generate context-sensitive scores of lived historical disadvantage across diverse geopolitical settings. Using unstructured self-identified ethnicity utterances from a multilingual COVID-19 global study, we design rule-guided prompting strategies that encourage models to produce interpretable, theoretically grounded estimations of oppression. We systematically evaluate these strategies across multiple state-of-the-art LLMs. Our results demonstrate that LLMs, when guided by explicit rules, can capture nuanced forms of identity-based historical oppression within nations. This approach provides a complementary measurement tool that highlights dimensions of systemic exclusion, offering a scalable, cross-cultural lens for understanding how oppression manifests in data-driven research and public health contexts. To support reproducible evaluation, we release an open-sourced benchmark dataset for assessing LLMs on oppression measurement (https://github.com/chattergpt/HSO-Bench).
comment: To appear in the 2025 IEEE International Conference on Big Data (IEEE BigData 2025)
♻ ☆ Adapting Physics-Informed Neural Networks for Bifurcation Detection in Ecological Migration Models
In this study, we explore the application of Physics-Informed Neural Networks (PINNs) to the analysis of bifurcation phenomena in ecological migration models. By integrating the fundamental principles of diffusion-advection-reaction equations with deep learning techniques, we address the complexities of species migration dynamics, particularly focusing on the detection and analysis of Hopf bifurcations. Traditional numerical methods for solving partial differential equations (PDEs) often involve intricate calculations and extensive computational resources, which can be restrictive in high-dimensional problems. In contrast, PINNs offer a more flexible and efficient alternative, bypassing the need for grid discretization and allowing for mesh-free solutions. Our approach leverages the DeepXDE framework, which enhances the computational efficiency and applicability of PINNs in solving high-dimensional PDEs. We validate our results against conventional methods and demonstrate that PINNs not only provide accurate bifurcation predictions but also offer deeper insights into the underlying dynamics of diffusion processes. Despite these advantages, the study also identifies challenges such as the high computational costs and the sensitivity of PINN performance to network architecture and hyperparameter settings. Future work will focus on optimizing these algorithms and expanding their application to other complex systems involving bifurcations. The findings from this research have significant implications for the modeling and analysis of ecological systems, providing a powerful tool for predicting and understanding complex dynamical behaviors.
comment: Upon further review, we have concluded that the study is not yet complete and requires additional data and validation to support its findings
♻ ☆ Learning Fair Representations with Kolmogorov-Arnold Networks AAAI-26
Despite recent advances in fairness-aware machine learning, predictive models often exhibit discriminatory behavior towards marginalized groups. Such unfairness might arise from biased training data, model design, or representational disparities across groups, posing significant challenges in high-stakes decision-making domains such as college admissions. While existing fair learning models aim to mitigate bias, achieving an optimal trade-off between fairness and accuracy remains a challenge. Moreover, the reliance on black-box models hinders interpretability, limiting their applicability in socially sensitive domains. To circumvent these issues, we propose integrating Kolmogorov-Arnold Networks (KANs) within a fair adversarial learning framework. Leveraging the adversarial robustness and interpretability of KANs, our approach facilitates stable adversarial learning. We derive theoretical insights into the spline-based KAN architecture that ensure stability during adversarial optimization. Additionally, an adaptive fairness penalty update mechanism is proposed to strike a balance between fairness and accuracy. We back these findings with empirical evidence on two real-world admissions datasets, demonstrating the proposed framework's efficiency in achieving fairness across sensitive attributes while preserving predictive performance.
comment: Accepted at AAAI-26
♻ ☆ How do data owners say no? A case study of data consent mechanisms in web-scraped vision-language AI training datasets
The internet has become the main source of data to train modern text-to-image or vision-language models, yet it is increasingly unclear whether web-scale data collection practices for training AI systems adequately respect data owners' wishes. Ignoring the owner's indication of consent around data usage not only raises ethical concerns but also has recently been elevated into lawsuits around copyright infringement cases. In this work, we aim to reveal information about data owners' consent to AI scraping and training, and study how it's expressed in DataComp, a popular dataset of 12.8 billion text-image pairs. We examine both the sample-level information, including the copyright notice, watermarking, and metadata, and the web-domain-level information, such as a site's Terms of Service (ToS) and Robots Exclusion Protocol. We estimate at least 122M of samples exhibit some indication of copyright notice in CommonPool, and find that 60\% of the samples in the top 50 domains come from websites with ToS that prohibit scraping. Furthermore, we estimate 9-13\% with 95\% confidence interval of samples from CommonPool to contain watermarks, where existing watermark detection methods fail to capture them in high fidelity. Our holistic methods and findings show that data owners rely on various channels to convey data consent, of which current AI data collection pipelines do not entirely respect. These findings highlight the limitations of the current dataset curation/release practice and the need for a unified data consent framework taking AI purposes into consideration.
Computation and Language
☆ Agent-as-a-Graph: Knowledge Graph-Based Tool and Agent Retrieval for LLM Multi-Agent Systems
Recent advances in Large Language Model Multi-Agent Systems enable scalable orchestration and retrieval of specialized, parallelized subagents, each equipped with hundreds or thousands of Model Context Protocol (MCP) servers and tools. However, existing agent, MCP, and retrieval methods typically match queries against a single agent description, obscuring fine-grained tool capabilities of each agent, resulting in suboptimal agent selection. We introduce Agent-as-a-Graph retrieval, a knowledge graph retrieval augmented generation approach that represents both tools and their parent agents as nodes and edges in a knowledge graph. During retrieval, i) relevant agents and tool nodes are first retrieved through vector search, ii) we apply a type-specific weighted reciprocal rank fusion (wRRF) for reranking tools and agents, and iii) parent agents are traversed in the knowledge graph for the final set of agents. We evaluate Agent-as-a-Graph on the LiveMCPBenchmark, achieving 14.9% and 14.6% improvements in Recall@5 and nDCG@5 over prior state-of-the-art retrievers, and 2.4% improvements in wRRF optimizations.
☆ Rethinking Retrieval: From Traditional Retrieval Augmented Generation to Agentic and Non-Vector Reasoning Systems in the Financial Domain for Large Language Models
Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models to answer financial questions using external knowledge bases of U.S. SEC filings, earnings reports, and regulatory documents. However, existing work lacks systematic comparison of vector-based and non-vector RAG architectures for financial documents, and the empirical impact of advanced RAG techniques on retrieval accuracy, answer quality, latency, and cost remain unclear. We present the first systematic evaluation comparing vector-based agentic RAG using hybrid search and metadata filtering against hierarchical node-based systems that traverse document structure without embeddings. We evaluate two enhancement techniques applied to the vector-based architecture, i) cross-encoder reranking for retrieval precision, and ii) small-to-big chunk retrieval for context completeness. Across 1,200 SEC 10-K, 10-Q, and 8-K filings on a 150-question benchmark, we measure retrieval metrics (MRR, Recall@5), answer quality through LLM-as-a-judge pairwise comparisons, latency, and preprocessing costs. Vector-based agentic RAG achieves a 68% win rate over hierarchical node-based systems with comparable latency (5.2 compared to 5.98 seconds). Cross-encoder reranking achieves a 59% absolute improvement at optimal parameters (10, 5) for MRR@5. Small-to-big retrieval achieves a 65% win rate over baseline chunking with only 0.2 seconds additional latency. Our findings reveal that applying advanced RAG techniques to financial Q&A systems improves retrieval accuracy, answer quality, and has cost-performance tradeoffs to be considered in production.
comment: 8 pages, 2 figures
☆ Vector Arithmetic in Concept and Token Subspaces NeurIPS 2025
In order to predict the next token, LLMs must represent semantic and surface-level information about the current word. Previous work identified two types of attention heads that disentangle this information: (i) Concept induction heads, which copy word meanings, and (ii) Token induction heads, which copy literal token representations (Feucht et al., 2025). We show that these heads can be used to identify subspaces of model activations that exhibit coherent semantic structure in Llama-2-7b. Specifically, when we transform hidden states using the attention weights of concept heads, we are able to more accurately perform parallelogram arithmetic (Mikolov et al., 2013) on the resulting hidden states, e.g., showing that "Athens" - "Greece" + "China" = "Beijing". This transformation allows for much higher nearest-neighbor accuracy (80%) than direct use of raw hidden states (47%). Analogously, we show that token heads allow for transformations that reveal surface-level word information in hidden states, allowing for operations like "coding" - "code" + "dance" = "dancing".
comment: 9 pages, 6 figures. NeurIPS 2025 Mechanistic Interpretability Workshop
☆ GeeSanBhava: Sentiment Tagged Sinhala Music Video Comment Data Set
This study introduce GeeSanBhava, a high-quality data set of Sinhala song comments extracted from YouTube manually tagged using Russells Valence-Arousal model by three independent human annotators. The human annotators achieve a substantial inter-annotator agreement (Fleiss kappa = 84.96%). The analysis revealed distinct emotional profiles for different songs, highlighting the importance of comment based emotion mapping. The study also addressed the challenges of comparing comment-based and song-based emotions, mitigating biases inherent in user-generated content. A number of Machine learning and deep learning models were pre-trained on a related large data set of Sinhala News comments in order to report the zero-shot result of our Sinhala YouTube comment data set. An optimized Multi-Layer Perceptron model, after extensive hyperparameter tuning, achieved a ROC-AUC score of 0.887. The model is a three-layer MLP with a configuration of 256, 128, and 64 neurons. This research contributes a valuable annotated dataset and provides insights for future work in Sinhala Natural Language Processing and music emotion recognition.
♻ ☆ TyphoFormer: Language-Augmented Transformer for Accurate Typhoon Track Forecasting SP
Accurate typhoon track forecasting is crucial for early system warning and disaster response. While Transformer-based models have demonstrated strong performance in modeling the temporal dynamics of dense trajectories of humans and vehicles in smart cities, they usually lack access to broader contextual knowledge that enhances the forecasting reliability of sparse meteorological trajectories, such as typhoon tracks. To address this challenge, we propose TyphoFormer, a novel framework that incorporates natural language descriptions as auxiliary prompts to improve typhoon trajectory forecasting. For each time step, we use Large Language Model (LLM) to generate concise textual descriptions based on the numerical attributes recorded in the North Atlantic hurricane database. The language descriptions capture high-level meteorological semantics and are embedded as auxiliary special tokens prepended to the numerical time series input. By integrating both textual and sequential information within a unified Transformer encoder, TyphoFormer enables the model to leverage contextual cues that are otherwise inaccessible through numerical features alone. Extensive experiments are conducted on HURDAT2 benchmark, results show that TyphoFormer consistently outperforms other state-of-the-art baseline methods, particularly under challenging scenarios involving nonlinear path shifts and limited historical observations.
comment: Accepted by ACM SIGSPATIAL 2025. Received SIGSPATIAL '25 Best Short Paper Award
♻ ☆ Nested-ReFT: Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts
Advanced reasoning in LLMs on challenging domains like mathematical reasoning can be tackled using verifiable rewards based reinforced fine-tuning (ReFT). In standard ReFT frameworks, a behavior model generates multiple completions with answers per problem, for the answer to be then scored by a reward function. While such RL post-training methods demonstrate significant performance improvements across challenging reasoning domains, the computational cost of generating completions during training with multiple inference steps makes the training cost non-trivial. To address this, we draw inspiration from off-policy RL, and speculative decoding to introduce a novel ReFT framework, dubbed Nested-ReFT, where a subset of layers of the target model acts as the behavior model to generate off-policy completions during training. The behavior model configured with dynamic layer skipping per batch during training decreases the inference cost compared to the standard ReFT frameworks. Our theoretical analysis shows that Nested-ReFT yields unbiased gradient estimates with controlled variance. Our empirical analysis demonstrates improved computational efficiency measured as tokens/sec across multiple math reasoning benchmarks and model sizes. Additionally, we explore three variants of bias mitigation to minimize the off-policyness in the gradient updates that allows for maintaining performance that matches the baseline ReFT performance.
♻ ☆ MGen: Millions of Naturally Occurring Generics in Context SC
MGen is a dataset of over 4 million naturally occurring generic and quantified sentences extracted from diverse textual sources. Sentences in the dataset have long context documents, corresponding to websites and academic papers, and cover 11 different quantifiers. We analyze the features of generics sentences in the dataset, with interesting insights: generics can be long sentences (averaging over 16 words) and speakers often use them to express generalisations about people. MGen is the biggest and most diverse dataset of naturally occurring generic sentences, opening the door to large-scale computational research on genericity. It is publicly available at https://gustavocilleruelo.com/mgen
comment: Presented at SCiL 2025
♻ ☆ MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models
Large language models (LLMs) are starting to complement traditional information seeking mechanisms such as web search. LLM-powered chatbots like ChatGPT are gaining prominence among the general public. AI chatbots are also increasingly producing content on social media platforms. However, LLMs are also prone to hallucinations, generating plausible yet factually incorrect or fabricated information. This becomes a critical problem when laypeople start seeking information about sensitive issues such as healthcare. Existing works in LLM hallucinations in the medical domain mainly focus on testing the medical knowledge of LLMs through standardized medical exam questions which are often well-defined and clear-cut with definitive answers. However, these approaches may not fully capture how these LLMs perform during real-world interactions with patients. This work conducts a pioneering study on hallucinations in LLM-generated responses to real-world healthcare queries from patients.We introduce MedHalu, a novel medical hallucination benchmark featuring diverse health-related topics and hallucinated responses from LLMs, with detailed annotation of the hallucination types and text spans. We also propose MedHaluDetect, a comprehensive framework for evaluating LLMs' abilities to detect hallucinations. Furthermore, we study the vulnerability to medical hallucinations among three groups -- medical experts, LLMs, and laypeople. Notably, LLMs significantly underperform human experts and, in some cases, even laypeople in detecting medical hallucinations. To improve hallucination detection, we propose an expert-in-the-loop approach that integrates expert reasoning into LLM inputs, significantly improving hallucination detection for all LLMs, including a 6.3% macro-F1 improvement for GPT-4. Our code and dataset are available at https://netsys.surrey.ac.uk/datasets/medhalu/.
comment: Accepted at ICWSM2026. https://netsys.surrey.ac.uk/datasets/medhalu/
Computers and Society
☆ Enhancing Large Language Models for Automated Homework Assessment in Undergraduate Circuit Analysis
This research full paper presents an enhancement pipeline for large language models (LLMs) in assessing homework for an undergraduate circuit analysis course, aiming to improve LLMs' capacity to provide personalized support to electrical engineering students. Existing evaluations have demonstrated that GPT-4o possesses promising capabilities in assessing student homework in this domain. Building on these findings, we enhance GPT-4o's performance through multi-step prompting, contextual data augmentation, and the incorporation of targeted hints. These strategies effectively address common errors observed in GPT-4o's responses when using simple prompts, leading to a substantial improvement in assessment accuracy. Specifically, the correct response rate for GPT-4o increases from 74.71% to 97.70% after applying the enhanced prompting and augmented data on entry-level circuit analysis topics. This work lays a foundation for the effective integration of LLMs into circuit analysis instruction and, more broadly, into engineering education.
comment: Accepted to 2025 Frontiers in Education (FIE) Conference
☆ The Workflow as Medium: A Framework for Navigating Human-AI Co-Creation
This paper introduces the Creative Intelligence Loop (CIL), a novel socio-technical framework for responsible human-AI co-creation. Rooted in the 'Workflow as Medium' paradigm, the CIL proposes a disciplined structure for dynamic human-AI collaboration, guiding the strategic integration of diverse AI teammates who function as collaborators while the human remains the final arbiter for ethical alignment and creative integrity. The CIL was empirically demonstrated through the practice-led creation of two graphic novellas, investigating how AI could serve as an effective creative colleague within a subjective medium lacking objective metrics. The process required navigating multifaceted challenges including AI's 'jagged frontier' of capabilities, sycophancy, and attention-scarce feedback environments. This prompted iterative refinement of teaming practices, yielding emergent strategies: a multi-faceted critique system integrating adversarial AI roles to counter sycophancy, and prioritizing 'feedback-ready' concrete artifacts to elicit essential human critique. The resulting graphic novellas analyze distinct socio-technical governance failures: 'The Steward' examines benevolent AI paternalism in smart cities, illustrating how algorithmic hubris can erode freedom; 'Fork the Vote' probes democratic legitimacy by comparing centralized AI opacity with emergent collusion in federated networks. This work contributes a self-improving framework for responsible human-AI co-creation and two graphic novellas designed to foster AI literacy and dialogue through accessible narrative analysis of AI's societal implications.
comment: 57 pages, 13 images, 6 tables
☆ CAPIRE Intervention Lab: An Agent-Based Policy Simulation Environment for Curriculum-Constrained Engineering Programmes
Engineering programmes in Latin America combine high structural rigidity, intense assessment cultures and persistent socio-economic inequality, producing dropout rates that remain stubbornly high despite increasingly accurate early-warning models. Predictive learning analytics can identify students at risk, but they offer limited guidance on which concrete combinations of policies should be implemented, when, and for whom. This paper presents the CAPIRE Intervention Lab, an agent-based simulation environment designed to complement predictive models with in silico experimentation on curriculum and teaching policies in a Civil Engineering programme. The model is calibrated on 1,343 students from 15 cohorts in a six-year programme with 34 courses and 12 simulated semesters. Agents are initialised from empirically derived trajectory archetypes and embedded in a curriculum graph with structural friction indicators, including backbone completion, blocked credits and distance to graduation. Each agent evolves under combinations of three policy dimensions: (A) curriculum and assessment structure, (B) teaching and academic support, and (C) psychosocial and financial support. A 2x2x2 factorial design with 100 replications per scenario yields over 80,000 simulated trajectories. Results show that policy bundles targeting early backbone courses and blocked credits can reduce long-term dropout by approximately three percentage points and substantially increase the number of courses passed by structurally vulnerable archetypes, while leaving highly regular students almost unaffected. The Intervention Lab thus shifts learning analytics from static prediction towards dynamic policy design, offering institutions a transparent, extensible sandbox to test curriculum and teaching reforms before large-scale implementation.
comment: 33 pages, 3 tables, 2 figures
☆ Animated Territorial Data Extractor (ATDE): A Computer-Vision Method for Extracting Territorial Data from Animated Historical Maps
We present Animated Territorial Data Extractor (ATDE), a computer vision tool that extracts quantitative territorial data from animated historical map videos. ATDE employs HSV-based color segmentation, RGB channel filtering, and Direct-Neighbor Filtering to identify and count pixels representing territorial control. Combined with preprocessing for temporal alignment and cross-video scaling, the pipeline converts animated videos into structured time-series data. We demonstrate the tool on ten Chinese dynasties (200 BCE - 1912 CE), producing year-by-year pixel counts that align with expected historical patterns. While not a substitute for authoritative historical datasets, ATDE is well-suited for educational demonstrations, preliminary data exploration, and comparative analysis of territorial dynamics. The tool requires no pre-existing shapefiles and can be applied to any animated map video given seed colors and basic configuration. Code and examples are available on GitHub.
comment: 11 pages, 5 figures
☆ A superpersuasive autonomous policy debating system AAAI 2026
The capacity for highly complex, evidence-based, and strategically adaptive persuasion remains a formidable great challenge for artificial intelligence. Previous work, like IBM Project Debater, focused on generating persuasive speeches in simplified and shortened debate formats intended for relatively lay audiences. We introduce DeepDebater, a novel autonomous system capable of participating in and winning a full, unmodified, two-team competitive policy debate. Our system employs a hierarchical architecture of specialized multi-agent workflows, where teams of LLM-powered agents collaborate and critique one another to perform discrete argumentative tasks. Each workflow utilizes iterative retrieval, synthesis, and self-correction using a massive corpus of policy debate evidence (OpenDebateEvidence) and produces complete speech transcripts, cross-examinations, and rebuttals. We introduce a live, interactive end-to-end presentation pipeline that renders debates with AI speech and animation: transcripts are surface-realized and synthesized to audio with OpenAI TTS, and then displayed as talking-head portrait videos with EchoMimic V1. Beyond fully autonomous matches (AI vs AI), DeepDebater supports hybrid human-AI operation: human debaters can intervene at any stage, and humans can optionally serve as opponents against AI in any speech, allowing AI-human and AI-AI rounds. In preliminary evaluations against human-authored cases, DeepDebater produces qualitatively superior argumentative components and consistently wins simulated rounds as adjudicated by an independent autonomous judge. Expert human debate coaches also prefer the arguments, evidence, and cases constructed by DeepDebater. We open source all code, generated speech transcripts, audio and talking head video here: https://github.com/Hellisotherpeople/DeepDebater/tree/main
comment: Accepted to CLIP workshop at AAAI 2026
♻ ☆ Toward Adaptive Categories: Dimensional Governance for Agentic AI
As AI systems evolve from static tools to dynamic agents, traditional categorical governance frameworks -- based on fixed risk tiers, levels of autonomy, or human oversight models -- are increasingly insufficient on their own. Systems built on foundation models, self-supervised learning, and multi-agent architectures increasingly blur the boundaries that categories were designed to police. In this Perspective, we make the case for dimensional governance: a framework that tracks how decision authority, process autonomy, and accountability (the 3As) distribute dynamically across human-AI relationships. A critical advantage of this approach is its ability to explicitly monitor system movement toward and across key governance thresholds, enabling preemptive adjustments before risks materialize. This dimensional approach provides the necessary foundation for more adaptive categorization, enabling thresholds and classifications that can evolve with emerging capabilities. While categories remain essential for decision-making, building them upon dimensional foundations allows for context-specific adaptability and stakeholder-responsive governance that static approaches cannot achieve. We outline key dimensions, critical trust thresholds, and practical examples illustrating where rigid categorical frameworks fail -- and where a dimensional mindset could offer a more resilient and future-proof path forward for both governance and innovation at the frontier of artificial intelligence.
comment: 12 pages core text, 15 pages including references, 2 figures
♻ ☆ Mutually Assured Deregulation
We have convinced ourselves that the way to make AI safe is to make it unsafe. Since 2022, policymakers worldwide have embraced the Regulation Sacrifice - the belief that dismantling safety oversight will deliver security through AI dominance. Fearing China or USA will gain advantage, nations rush to eliminate safeguards that might slow progress. This Essay reveals the fatal flaw: though AI poses national security challenges, the solution demands stronger regulatory frameworks, not weaker ones. A race without guardrails breeds shared danger, not competitive strength. The Regulation Sacrifice makes three false promises. First, it promises durable technological leads. But AI capabilities spread rapidly - performance gaps between U.S. and Chinese systems collapsed from 9 percent to 2 percent in thirteen months. When advantages evaporate in months, sacrificing permanent safety for temporary speed makes no sense. Second, it promises deregulation accelerates innovation. The opposite often proves true. Companies report well-designed governance streamlines development. Investment flows toward regulated markets. Clear rules reduce uncertainty; uncertain liability creates paralysis. Environmental standards did not kill the auto industry; they created Tesla and BYD. Third, enhanced national security through deregulation actually undermines security across all timeframes. Near term: it hands adversaries information warfare tools. Medium term: it democratizes bioweapon capabilities. Long term: it guarantees deployment of uncontrollable AGI systems. The Regulation Sacrifice persists because it serves powerful interests, not security. Tech companies prefer freedom to accountability. Politicians prefer simple stories to complex truths. This creates mutually assured deregulation, where each nation's sprint for advantage guarantees collective vulnerability. The only way to win is not to play.
♻ ☆ RELEAP: Reinforcement-Enhanced Label-Efficient Active Phenotyping for Electronic Health Records
Objective: Electronic health record (EHR) phenotyping often relies on noisy proxy labels, which undermine the reliability of downstream risk prediction. Active learning can reduce annotation costs, but most rely on fixed heuristics and do not ensure that phenotype refinement improves prediction performance. Our goal was to develop a framework that directly uses downstream prediction performance as feedback to guide phenotype correction and sample selection under constrained labeling budgets. Materials and Methods: We propose Reinforcement-Enhanced Label-Efficient Active Phenotyping (RELEAP), a reinforcement learning-based active learning framework. RELEAP adaptively integrates multiple querying strategies and, unlike prior methods, updates its policy based on feedback from downstream models. We evaluated RELEAP on a de-identified Duke University Health System (DUHS) cohort (2014-2024) for incident lung cancer risk prediction, using logistic regression and penalized Cox survival models. Performance was benchmarked against noisy-label baselines and single-strategy active learning. Results: RELEAP consistently outperformed all baselines. Logistic AUC increased from 0.774 to 0.805 and survival C-index from 0.718 to 0.752. Using downstream performance as feedback, RELEAP produced smoother and more stable gains than heuristic methods under the same labeling budget. Discussion: By linking phenotype refinement to prediction outcomes, RELEAP learns which samples most improve downstream discrimination and calibration, offering a more principled alternative to fixed active learning rules. Conclusion: RELEAP optimizes phenotype correction through downstream feedback, offering a scalable, label-efficient paradigm that reduces manual chart review and enhances the reliability of EHR-based risk prediction.
comment: 20 pages, 5 figures, 1 table. Includes supplementary material. Submitted to JAMIA Open. † These authors contributed equally. *Corresponding author: Chuan Hong
♻ ☆ Fairness in Streaming Submodular Maximization over a Matroid Constraint
Streaming submodular maximization is a natural model for the task of selecting a representative subset from a large-scale dataset. If datapoints have sensitive attributes such as gender or race, it becomes important to enforce fairness to avoid bias and discrimination. This has spurred significant interest in developing fair machine learning algorithms. Recently, such algorithms have been developed for monotone submodular maximization under a cardinality constraint. In this paper, we study the natural generalization of this problem to a matroid constraint. We give streaming algorithms as well as impossibility results that provide trade-offs between efficiency, quality and fairness. We validate our findings empirically on a range of well-known real-world applications: exemplar-based clustering, movie recommendation, and maximum coverage in social networks.
comment: Correcting error in Proposition C.6. This doesn't affect any other result in the paper
Computers and Society
☆ Digital Diasporas: How Origin Characteristics and Host-Native Distance Shape Immigrants' Online Cultural Retention
Immigrants bring unique cultural backgrounds to their host countries. Subsequent interplay of cultures can lead to either a melting pot, where immigrants adopt the dominant culture of the host country, or a mosaic, where distinct cultural identities coexist. The existing literature primarily focuses on the acculturation of immigrants, specifically the melting pot hypothesis. In contrast, we attempt to identify the antecedents of the mosaic hypothesis or factors that enhance (or diminish) the propensity for cultural retention among immigrants. Based on Facebook advertising data for immigrants from 8 countries residing in the USA, our findings suggest that greater host-native distance is linked to higher online cultural retention, and while origin country context is statistically significant, its impact is generally smaller.
comment: This paper will appear at ICWSM 2026. Please cite the peer-reviewed version
☆ When Administrative Networks Fail: Curriculum Structure, Early Performance, and the Limits of Co-enrolment Social Synchrony for Dropout Prediction in Engineering Education
Social integration theories suggest that students embedded in supportive peer networks are less likely to drop out. In learning analytics, this has motivated the use of social network analysis (SNA) from institutional co-enrolment data to predict attrition. This study tests whether such administrative network features add predictive value beyond a leakage-aware, curriculum-graph-informed model in a long-cycle Civil Engineering programme at a public university in Argentina. Using a three-semester observation window and a 16-fold leave-cohort-out design on 1,343 students across 15 cohorts, we compare four configurations: a baseline model (M0), baseline plus network features (M1), baseline plus curriculum-graph features (M2), and a full model (M3). After a leakage audit removed two post-outcome variables that had produced implausibly perfect performance, retrained models show that M0 and M2 achieve F1 = 0.9411 and ROC-AUC = 0.9776, while adding network features systematically degrades performance (M1 and M3: F1 = 0.9367; ROC-AUC = 0.9768). We conclude that in curriculum-constrained programmes, administrative co-enrolment SNA does not provide additional risk information beyond curriculum topology and early academic performance.
☆ Liberating Logic in the Age of AI: Going Beyond Programming with Computational Thinking
Mastering one or more programming languages has historically been the gateway to implementing ideas on a computer. Today, that gateway is widening with advances in large language models (LLMs) and artificial intelligence (AI)-powered coding assistants. What matters is no longer just fluency in traditional programming languages but the ability to think computationally by translating problems into forms that can be solved with computing tools. The capabilities enabled by these AI-augmented tools are rapidly leading to the commoditization of computational thinking, such that anyone who can articulate a problem in natural language can potentially harness computing power via AI. This shift is poised to radically influence how we teach computer science and data science in the United States and around the world. Educators and industry leaders are grappling with how to adapt: What should students learn when the hottest new programming language is English? How do we prepare a generation of computational thinkers who need not code every algorithm manually, but must still think critically, design solutions, and verify AI-augmented results? This paper explores these questions, examining the impact of natural language programming on software development, the emerging distinction between programmers and prompt-crafting problem solvers, the reforms needed in computer science and data science curricula, and the importance of maintaining our fundamental computational science principles in an AI-augmented future. Along the way, we compare approaches and share best practices for embracing this new paradigm in computing education.
comment: 15 pages and 17 figures
☆ Smart Metadata in Action: The Social Impact Data Commons
This article describes the use of metadata and standards in the Social Impact Data Commons to expose official statisticians to an innovative project built on actionable and evaluable metadata, which produces a FAIR data system. We begin by introducing the concept of the Data Commons, focusing on its features, and presenting an overview of current implementations of the Data Commons. We then present the core metadata case study, demonstrating how smart metadata support the Data Commons. We also present evaluations of our core metadata, including its adherence to the FAIR guidelines. We conclude with a discussion on our future metadata and standards-related projects to support the Social Impact Data Commons.
comment: Conference On Smart Metadata for Official Statistics 2024 (COSMOS 2024), April, 11-12, 2024, Paris, France
☆ AI Workers, Geopolitics, and Algorithmic Collective Action
According to the theory of International Political Economy (IPE), states are often incentivized to rely on rather than constrain powerful corporations. For this reason, IPE provides a useful lens to explain why efforts to govern Artificial Intelligence (AI) at the international and national levels have thus far been developed, applied, and enforced unevenly. Building on recent work that explores how AI companies engage in geopolitics, this position paper argues that some AI workers can be considered actors of geopolitics. It makes the timely case that governance alone cannot ensure responsible, ethical, or robust AI development and use, and greater attention should be paid to bottom-up interventions at the site of AI development. AI workers themselves should be situated as individual agents of change, especially when considering their potential to foster Algorithmic Collective Action (ACA). Drawing on methods of Participatory Design (PD), this paper proposes engaging AI workers as sources of knowledge, relative power, and intentionality to encourage more responsible and just AI development and create the conditions that can facilitate ACA.
☆ Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation
Multilingual text-to-image (T2I) models have advanced rapidly in terms of visual realism and semantic alignment, and are now widely utilized. Yet outputs vary across cultural contexts: because language carries cultural connotations, images synthesized from multilingual prompts should preserve cross-lingual cultural consistency. We conduct a comprehensive analysis showing that current T2I models often produce culturally neutral or English-biased results under multilingual prompts. Analyses of two representative models indicate that the issue stems not from missing cultural knowledge but from insufficient activation of culture-related representations. We propose a probing method that localizes culture-sensitive signals to a small set of neurons in a few fixed layers. Guided by this finding, we introduce two complementary alignment strategies: (1) inference-time cultural activation that amplifies the identified neurons without backbone fine-tuned; and (2) layer-targeted cultural enhancement that updates only culturally relevant layers. Experiments on our CultureBench demonstrate consistent improvements over strong baselines in cultural consistency while preserving fidelity and diversity.
☆ Cross-cultural value alignment frameworks for responsible AI governance: Evidence from China-West comparative analysis
As Large Language Models (LLMs) increasingly influence high-stakes decision-making across global contexts, ensuring their alignment with diverse cultural values has become a critical governance challenge. This study presents a Multi-Layered Auditing Platform for Responsible AI that systematically evaluates cross-cultural value alignment in China-origin and Western-origin LLMs through four integrated methodologies: Ethical Dilemma Corpus for assessing temporal stability, Diversity-Enhanced Framework (DEF) for quantifying cultural fidelity, First-Token Probability Alignment for distributional accuracy, and Multi-stAge Reasoning frameworK (MARK) for interpretable decision-making. Our comparative analysis of 20+ leading models, such as Qwen, GPT-4o, Claude, LLaMA, and DeepSeek, reveals universal challenges-fundamental instability in value systems, systematic under-representation of younger demographics, and non-linear relationships between model scale and alignment quality-alongside divergent regional development trajectories. While China-origin models increasingly emphasize multilingual data integration for context-specific optimization, Western models demonstrate greater architectural experimentation but persistent U.S.-centric biases. Neither paradigm achieves robust cross-cultural generalization. We establish that Mistral-series architectures significantly outperform LLaMA3-series in cross-cultural alignment, and that Full-Parameter Fine-Tuning on diverse datasets surpasses Reinforcement Learning from Human Feedback in preserving cultural variation...
comment: Presented on Academic Conference "Technology for Good: Driving Social Impact" (2025)
☆ Navigating in the Dark: A Multimodal Framework and Dataset for Nighttime Traffic Sign Recognition
Traffic signboards are vital for road safety and intelligent transportation systems, enabling navigation and autonomous driving. Yet, recognizing traffic signs at night remains challenging due to visual noise and scarcity of public nighttime datasets. Despite advances in vision architectures, existing methods struggle with robustness under low illumination and fail to leverage complementary mutlimodal cues effectively. To overcome these limitations, firstly, we introduce INTSD, a large-scale dataset comprising street-level night-time images of traffic signboards collected across diverse regions of India. The dataset spans 41 traffic signboard classes captured under varying lighting and weather conditions, providing a comprehensive benchmark for both detection and classification tasks. To benchmark INTSD for night-time sign recognition, we conduct extensive evaluations using state-of-the-art detection and classification models. Secondly, we propose LENS-Net, which integrates an adaptive image enhancement detector for joint illumination correction and sign localization, followed by a structured multimodal CLIP-GCNN classifier that leverages cross-modal attention and graph-based reasoning for robust and semantically consistent recognition. Our method surpasses existing frameworks, with ablation studies confirming the effectiveness of its key components. The dataset and code for LENS-Net is publicly available for research.
☆ The Promotion Wall: Efficiency-Equity Trade-offs of Direct Promotion Regimes in Engineering Education
Progression and assessment rules are often treated as administrative details, yet they fundamentally shape who is allowed to remain in higher education, and on what terms. This article uses a calibrated agent-based model to examine how alternative progression regimes reconfigure dropout, time-to-degree, equity and students' psychological experience in a long, tightly sequenced engineering programme. Building on a leakage-aware longitudinal dataset of 1,343 students and a Kaplan-Meier survival analysis of time-to-dropout, we simulate three policy scenarios: (A) a historical "regularity + finals" regime, where students accumulate exam debt; (B) a direct-promotion regime that removes regularity and finals but requires full course completion each term; and (C) a direct-promotion regime complemented by a capacity-limited remedial "safety net" for marginal failures in bottleneck courses. The model is empirically calibrated to reproduce the observed dropout curve under Scenario A and then used to explore counterfactuals. Results show that direct promotion creates a "promotion wall": attrition becomes sharply front-loaded in the first two years, overall dropout rises, and equity gaps between low- and high-resilience students widen, even as exam debt disappears. The safety-net scenario partially dismantles this wall: it reduces dropout and equity gaps relative to pure direct promotion and yields the lowest final stress levels, at the cost of additional, targeted teaching capacity. These findings position progression rules as central objects of assessment policy rather than neutral background. The article argues that claims of improved efficiency are incomplete unless they are evaluated jointly with inclusion, equity and students' psychological wellbeing, and it illustrates how simulation-based decision support can help institutions rehearse assessment reforms before implementing them.
comment: 42 pages, 4 figures, 3 tables
☆ Toward Sustainable Generative AI: A Scoping Review of Carbon Footprint and Environmental Impacts Across Training and Inference Stages
Generative AI is spreading rapidly, creating significant social and economic value while also raising concerns about its high energy use and environmental sustainability. While prior studies have predominantly focused on the energy-intensive nature of the training phase, the cumulative environmental footprint generated during large-scale service operations, particularly in the inference phase, has received comparatively less attention. To bridge this gap this study conducts a scoping review of methodologies and research trends in AI carbon footprint assessment. We analyze the classification and standardization status of existing AI carbon measurement tools and methodologies, and comparatively examine the environmental impacts arising from both training and inference stages. In addition, we identify how multidimensional factors such as model size, prompt complexity, serving environments, and system boundary definitions shape the resulting carbon footprint. Our review reveals critical limitations in current AI carbon accounting practices, including methodological inconsistencies, technology-specific biases, and insufficient attention to end-to-end system perspectives. Building on these insights, we propose future research and governance directions: (1) establishing standardized and transparent universal measurement protocols, (2) designing dynamic evaluation frameworks that incorporate user behavior, (3) developing life-cycle monitoring systems that encompass embodied emissions, and (4) advancing multidimensional sustainability assessment framework that balance model performance with environmental efficiency. This paper provides a foundation for interdisciplinary dialogue aimed at building a sustainable AI ecosystem and offers a baseline guideline for researchers seeking to understand the environmental implications of AI across technical, social, and operational dimensions.
comment: 43 pages, 5 figure, 3 tables
☆ A Counterfactual LLM Framework for Detecting Human Biases: A Case Study of Sex/Gender in Emergency Triage
We present a novel, domain-agnostic counterfactual approach that uses Large Language Models (LLMs) to quantify gender disparities in human clinical decision-making. The method trains an LLM to emulate observed decisions, then evaluates counterfactual pairs in which only gender is flipped, estimating directional disparities while holding all other clinical factors constant. We study emergency triage, validating the approach on more than 150,000 admissions to the Bordeaux University Hospital (France) and replicating results on a subset of MIMIC-IV across a different language, population, and healthcare system. In the Bordeaux cohort, otherwise identical presentations were approximately 2.1% more likely to receive a lower-severity triage score when presented as female rather than male; scaled to national emergency volumes in France, this corresponds to more than 200,000 lower-severity assignments per year. Modality-specific analyses indicate that both explicit tabular gender indicators and implicit textual gender cues contribute to the disparity. Beyond emergency care, the approach supports bias audits in other settings (e.g., hiring, academic, and justice decisions), providing a scalable tool to detect and address inequities in real-world decision-making.
comment: Currently under review at npj Digital Medicine
☆ Datacenters in the Desert: Feasibility and Sustainability of LLM Inference in the Middle East
As the Middle East emerges as a strategic hub for artificial intelligence (AI) infrastructure, the feasibility of deploying sustainable datacenters in desert environments has become a topic of growing relevance. This paper presents an empirical study analyzing the energy consumption and carbon footprint of large language model (LLM) inference across four countries: the United Arab Emirates, Iceland, Germany, and the United States of America using DeepSeek Coder 1.3B and the HumanEval dataset on the task of code generation. We use the CodeCarbon library to track energy and carbon emissions andcompare geographical trade-offs for climate-aware AI deployment. Our findings highlight both the challenges and potential of datacenters in desert regions and provide a balanced outlook on their role in global AI expansion.
comment: 3 pages, 1 figure
☆ A Cross-Cultural Assessment of Human Ability to Detect LLM-Generated Fake News about South Africa
This study investigates how cultural proximity affects the ability to detect AI-generated fake news by comparing South African participants with those from other nationalities. As large language models increasingly enable the creation of sophisticated fake news, understanding human detection capabilities becomes crucial, particularly across different cultural contexts. We conducted a survey where 89 participants (56 South Africans, 33 from other nationalities) evaluated 10 true South African news articles and 10 AI-generated fake versions. Results reveal an asymmetric pattern: South Africans demonstrated superior performance in detecting true news about their country (40% deviation from ideal rating) compared to other participants (52%), but performed worse at identifying fake news (62% vs. 55%). This difference may reflect South Africans' higher overall trust in news sources. Our analysis further shows that South Africans relied more on content knowledge and contextual understanding when judging credibility, while participants from other countries emphasised formal linguistic features such as grammar and structure. Overall, the deviation from ideal rating was similar between groups (51% vs. 53%), suggesting that cultural familiarity appears to aid verification of authentic information but may also introduce bias when evaluating fabricated content. These insights contribute to understanding cross-cultural dimensions of misinformation detection and inform strategies for combating AI-generated fake news in increasingly globalised information ecosystems where content crosses cultural and geographical boundaries.
☆ Energy Scaling Laws for Diffusion Models: Quantifying Compute and Carbon Emissions in Image Generation
The rapidly growing computational demands of diffusion models for image generation have raised significant concerns about energy consumption and environmental impact. While existing approaches to energy optimization focus on architectural improvements or hardware acceleration, there is a lack of principled methods to predict energy consumption across different model configurations and hardware setups. We propose an adaptation of Kaplan scaling laws to predict GPU energy consumption for diffusion models based on computational complexity (FLOPs). Our approach decomposes diffusion model inference into text encoding, iterative denoising, and decoding components, with the hypothesis that denoising operations dominate energy consumption due to their repeated execution across multiple inference steps. We conduct comprehensive experiments across four state-of-the-art diffusion models (Stable Diffusion 2, Stable Diffusion 3.5, Flux, and Qwen) on three GPU architectures (NVIDIA A100, A4000, A6000), spanning various inference configurations including resolution (256x256 to 1024x1024), precision (fp16/fp32), step counts (10-50), and classifier-free guidance settings. Our energy scaling law achieves high predictive accuracy within individual architectures (R-squared > 0.9) and exhibits strong cross-architecture generalization, maintaining high rank correlations across models and enabling reliable energy estimation for unseen model-hardware combinations. These results validate the compute-bound nature of diffusion inference and provide a foundation for sustainable AI deployment planning and carbon footprint estimation.
comment: Accepted at EurIPS 2025 workshop "Rethinking AI: Efficiency, Frugality, and Sustainability"
☆ Chatbots to strengthen democracy: An interdisciplinary seminar to train identifying argumentation techniques of science denial
In recent times, discussions on social media platforms have increasingly come under scrutiny due to the proliferation of science denial and fake news. Traditional solutions, such as regulatory actions, have been implemented to mitigate the spread of misinformation; however, these measures alone are not sufficient. To complement these efforts, educational approaches are becoming essential in empowering users to critically engage with misinformation. Conversation training, through serious games or personalized methods, has emerged as a promising strategy to help users handle science denial and toxic conversation tactics. This paper suggests an interdisciplinary seminar to explore the suitability of Large Language Models (LLMs) acting as a persona of a science denier to support people in identifying misinformation and improving resilience against toxic interactions. In the seminar, groups of four to five students will develop an AI-based chatbot that enables realistic interactions with science-denial argumentation structures. The task involves planning the setting, integrating a Large Language Model to facilitate natural dialogues, implementing the chatbot using the RASA framework, and evaluating the outcomes in a user study. It is crucial that users understand what they need to do during the interaction, how to conclude it, and how the relevant information is conveyed. The seminar does not aim to develop chatbots for practicing debunking but serves to teach AI technologies and test the feasibility of this idea for future applications. The chatbot seminar is conducted as a hybrid, parallel master's module at the participating educational institutions.
comment: 6 pages, 4 figures
☆ OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists
With the rapid development of Large Language Models (LLMs), AI agents have demonstrated increasing proficiency in scientific tasks, ranging from hypothesis generation and experimental design to manuscript writing. Such agent systems are commonly referred to as "AI Scientists." However, existing AI Scientists predominantly formulate scientific discovery as a standalone search or optimization problem, overlooking the fact that scientific research is inherently a social and collaborative endeavor. Real-world science relies on a complex scientific infrastructure composed of collaborative mechanisms, contribution attribution, peer review, and structured scientific knowledge networks. Due to the lack of modeling for these critical dimensions, current systems struggle to establish a genuine research ecosystem or interact deeply with the human scientific community. To bridge this gap, we introduce OmniScientist, a framework that explicitly encodes the underlying mechanisms of human research into the AI scientific workflow. OmniScientist not only achieves end-to-end automation across data foundation, literature review, research ideation, experiment automation, scientific writing, and peer review, but also provides comprehensive infrastructural support by simulating the human scientific system, comprising: (1) a structured knowledge system built upon citation networks and conceptual correlations; (2) a collaborative research protocol (OSP), which enables seamless multi-agent collaboration and human researcher participation; and (3) an open evaluation platform (ScienceArena) based on blind pairwise user voting and Elo rankings. This infrastructure empowers agents to not only comprehend and leverage human knowledge systems but also to collaborate and co-evolve, fostering a sustainable and scalable innovation ecosystem.
☆ Empa: An AI-Powered Virtual Mentor for Developing Global Collaboration Skills in HPC Education
High-performance computing (HPC) and parallel computing increasingly rely on global collaboration among diverse teams, yet traditional computing curricula inadequately prepare students for cross-cultural teamwork essential in modern computational research environments. This paper presents Empa, an AI-powered virtual mentor that integrates intercultural collaboration training into undergraduate computing education. Built using large language models and deployed through a progressive web application, Empa guides students through structured activities covering cultural dimensions, communication styles, and conflict resolution that are critical for effective multicultural teamwork. Our system addresses the growing need for culturally competent HPC professionals by helping computing students develop skills to collaborate effectively in international research teams, contribute to global computational projects, and navigate the cultural complexities inherent in distributed computing environments. Pilot preparation for deployment in computing courses demonstrates the feasibility of AI-mediated intercultural training and provides insights into scalable approaches for developing intercultural collaboration skills essential for HPC workforce development.
☆ Generative AI in Sociological Research: State of the Discipline
Generative artificial intelligence (GenAI) has garnered considerable attention for its potential utility in research and scholarship, even among those who typically do not rely on computational tools. Early commentators, however, have also articulated concerns about how GenAI usage comes with enormous environmental costs, serious social risks, and a tendency to produce low-quality content. In the midst of both excitement and skepticism, it is crucial to take stock of how GenAI is actually being used. Our study focuses on sociological research as our site, and here we present findings from a survey of 433 authors of articles published in 50 sociology journals in the last five years. The survey provides an overview of the state of the discipline with regard to the use of GenAI by providing answers to fundamental questions: how (much) do scholars use the technology for their research; what are their reasons for using it; and how concerned, trustful, and optimistic are they about the technology? Of the approximately one third ofrespondents who self-report using GenAI at least weekly, the primary uses are for writing assistance and comparatively less so in planning, data collection, or data analysis. In both use and attitudes, there are surprisingly few differences between self-identified computational and non-computational researchers. Generally, respondents are very concerned about the social and environmental consequences of GenAI. Trust in GenAI outputs is low, regardless of expertise or frequency of use. While optimism that GenAI will improve is high, scholars are divided on whether GenAI will have a positive impact on the field.
☆ CubeletWorld: A New Abstraction for Scalable 3D Modeling
Modern cities produce vast streams of heterogeneous data, from infrastructure maps to mobility logs and satellite imagery. However, integrating these sources into coherent spatial models for planning and prediction remains a major challenge. Existing agent-centric methods often rely on direct environmental sensing, limiting scalability and raising privacy concerns. This paper introduces CubeletWorld, a novel framework for representing and analyzing urban environments through a discretized 3D grid of spatial units called cubelets. This abstraction enables privacy-preserving modeling by embedding diverse data signals, such as infrastructure, movement, or environmental indicators, into localized cubelet states. CubeletWorld supports downstream tasks such as planning, navigation, and occupancy prediction without requiring agent-driven sensing. To evaluate this paradigm, we propose the CubeletWorld State Prediction task, which involves predicting the cubelet state using a realistic dataset containing various urban elements like streets and buildings through this discretized representation. We explore a range of modified core models suitable for our setting and analyze challenges posed by increasing spatial granularity, specifically the issue of sparsity in representation and scalability of baselines. In contrast to existing 3D occupancy prediction models, our cubelet-centric approach focuses on inferring state at the spatial unit level, enabling greater generalizability across regions and improved privacy compliance. Our results demonstrate that CubeletWorld offers a flexible and extensible framework for learning from complex urban data, and it opens up new possibilities for scalable simulation and decision support in domains such as socio-demographic modeling, environmental monitoring, and emergency response. The code and datasets can be downloaded from here.
comment: 10 pages, 5 figures
♻ ☆ Social and Ethical Risks Posed by General-Purpose LLMs for Settling Newcomers in Canada
The non-profit settlement sector in Canada supports newcomers in achieving successful integration. This sector faces increasing operational pressures amidst rising immigration targets, which highlights a need for enhanced efficiency and innovation, potentially through reliable AI solutions. The ad-hoc use of general-purpose generative AI, such as ChatGPT, might become a common practice among newcomers and service providers to address this need. However, these tools are not tailored for the settlement domain and can have detrimental implications for immigrants and refugees. We explore the risks that these tools might pose on newcomers to first, warn against the unguarded use of generative AI, and second, to incentivize further research and development in creating AI literacy programs as well as customized LLMs that are aligned with the preferences of the impacted communities. Crucially, such technologies should be designed to integrate seamlessly into the existing workflow of the settlement sector, ensuring human oversight, trustworthiness, and accountability.
comment: 26 pages, 8 figures
♻ ☆ Collective moderation of hate, toxicity, and extremity in online discussions
In the digital age, hate speech poses a threat to the functioning of social media platforms as spaces for public discourse. Top-down approaches to moderate hate speech encounter difficulties due to conflicts with freedom of expression and issues of scalability. Counter speech, a form of collective moderation by citizens, has emerged as a potential remedy. Here, we aim to investigate which counter speech strategies are most effective in reducing the prevalence of hate, toxicity, and extremity on online platforms. We analyze more than 130,000 discussions on German Twitter starting at the peak of the migrant crisis in 2015 and extending over four years. We use human annotation and machine learning classifiers to identify argumentation strategies, ingroup and outgroup references, emotional tone, and different measures of discourse quality. Using matching and time-series analyses we discern the effectiveness of naturally observed counter speech strategies on the micro-level (individual tweet pairs), meso-level (entire discussions) and macro-level (over days). We find that expressing straightforward opinions, even if not factual but devoid of insults, results in the least subsequent hate, toxicity, and extremity over all levels of analyses. This strategy complements currently recommended counter speech strategies and is easy for citizens to engage in. Sarcasm can also be effective in improving discourse quality, especially in the presence of organized extreme groups. Going beyond one-shot analyses on smaller samples prevalent in most prior studies, our findings have implications for the successful management of public online spaces through collective civic moderation.
♻ ☆ A Detailed Comparative Analysis of Blockchain Consensus Mechanisms CCS
This paper presents a comprehensive comparative analysis of two dominant blockchain consensus mechanisms, Proof of Work (PoW) and Proof of Stake (PoS), evaluated across seven critical metrics: energy use, security, transaction speed, scalability, centralization risk, environmental impact, and transaction fees. Utilizing recent academic research and real-world blockchain data, the study highlights that PoW offers robust, time-tested security but suffers from high energy consumption, slower throughput, and centralization through mining pools. In contrast, PoS demonstrates improved scalability and efficiency, significantly reduced environmental impact, and more stable transaction fees, however it raises concerns over validator centralization and long-term security maturity. The findings underscore the trade-offs inherent in each mechanism and suggest hybrid designs may combine PoW's security with PoS's efficiency and sustainability. The study aims to inform future blockchain infrastructure development by striking a balance between decentralization, performance, and ecological responsibility.
comment: 15 Pages, 3 Figures, 2 Tables, accepted to CCSC Eastern 2025
♻ ☆ A Longitudinal Study on the Attitudes of Gay Men in Beijing Towards Gay Social Media Platforms: Lonely Souls in the Digital Concrete Jungle
Over the past decade, specialized social networking applications have become a cornerstone of life for many gay men in China. This paper employs a longitudinal mixed-methods approach to investigate how Chinese men who have sex with men (MSM) have shifted their attitudes toward these platforms between approximately 2013 and 2023. Drawing on archival analysis of online discourses, a quantitative survey of 412 participants, and in-depth semi-structured interviews with 32 participants, we trace the complex trajectory of this evolution. Our findings reveal a clear pattern: from the initial embrace of these applications as revolutionary tools for community building and identity affirmation (2014--2017), to a period of growing ambivalence and critique centered on commercialization, ``hookup culture,'' and multiple forms of discrimination (2017--2020), and finally to the present era (2020--2023), characterized by pragmatic, fragmented, yet simultaneously critical and reconstructive uses. Today, users strategically employ a repertoire of applications -- including global platforms (e.g., Grindr and Tinder), domestic mainstream platforms (e.g., Blued), and niche alternatives (e.g., Aloha) -- to fulfill differentiated needs. We develop a detailed temporal framework to capture this attitudinal evolution and discuss its design implications for creating more supportive, secure, and community-oriented digital environments for marginalized groups.
Computers and Society
☆ Predicting Healthcare Provider Engagement in SMS Campaigns
As digital communication grows in importance when connecting with healthcare providers, traditional behavioral and content message features are imbued with renewed significance. If one is to meaningfully connect with them, it is crucial to understand what drives them to engage and respond. In this study, the authors analyzed several million text messages sent through the Impiricus platform to learn which factors influenced whether or not a doctor clicked on a link in a message. Several key insights came to light through the use of logistic regression, random forest, and neural network models, the details of which the authors discuss in this paper.
☆ Explainable Deep Learning for Brain Tumor Classification: Comprehensive Benchmarking with Dual Interpretability and Lightweight Deployment
Our study provides a full deep learning system for automated classification of brain tumors from MRI images, includes six benchmarked architectures (five ImageNet-pre-trained models (VGG-16, Inception V3, ResNet-50, Inception-ResNet V2, Xception) and a custom built, compact CNN (1.31M params)). The study moves the needle forward in a number of ways, including (1) full standardization of assessment with respect to preprocessing, training sets/protocols (optimizing networks with the AdamW optimizer, CosineAnnealingLR, patiene for early stopping = 7), and metrics to assess performance were identical along all models; (2) a high level of confidence in the localizations based on prior studies as both Grad-CAM and GradientShap explanation were used to establish anatomically important and meaningful attention regions and address the black-box issue; (3) a compact 1.31 million parameter CNN was developed that achieved 96.49% testing accuracy and was 100 times smaller than Inception-ResNet V2 while permitting real-time inference (375ms) on edge devices; (4) full evaluation beyond accuracy reporting based on measures of intersection over union, Hausdorff distance, and precision-recall curves, and confusion matrices across all splits. Inception-ResNet V2 reached state-of-the-art performance, achieving a 99.53% accuracy on testing and obtaining a precision, recall, and F1-score of at least 99.50% dominant performance based on metrics of recent studies. We demonstrated a lightweight model that is suitable to deploy on devices that do not have multi-GPU infrastructure in under-resourced settings. This end-to-end solution considers accuracy, interpretability, and deployability of trustworthy AI to create the framework necessary for performance assessment and deployment within advance and low-resource healthcare systems to an extent that enabled participation at the clinical screening and triage level.
comment: This paper contains 17 pages, 4 tables, and 19 figures. This Paper is already accepted in IEEE Computational Intelligence Magazine (CIM)
☆ On the modular platoon-based vehicle-to-vehicle electric charging problem
We formulate a mixed integer linear program (MILP) for a platoon-based vehicle-to-vehicle charging (PV2VC) technology designed for modular vehicles (MVs) and solve it with a genetic algorithm (GA). A set of numerical experiments with five scenarios are tested and the computational performance between the commercial software applied to the MILP model and the proposed GA are compared on a modified Sioux Falls network. By comparison with the optimal benchmark scenario, the results show that the PV2VC technology can save up to 11.07% in energy consumption, 11.65% in travel time, and 11.26% in total cost. For the PV2VC operational scenario, it would be more beneficial for long-distance vehicle routes with low initial state of charge, sparse charging facilities, and where travel time is perceived to be higher than energy consumption costs.
☆ Inclusive education via empathy propagation in schools of students with special education needs
This study presents a theoretical model for identifying emergent scenarios of inclusiveness related to student with special education needs (SEN). Based on variations of the Shelling model of segregation, we explored the propagation of thinking about others as equals (empathy) in students with and without $SEN$ in school environments. We use the complex systems approach for modeling possible scenarios of inclusiveness in which patterns of empathy between students emerge instead of the well-known behavior of segregation. Based on simple transitional rules, which are evaluated by a set of null models, we show the emergence of empathy between students in school environments. Findings suggest that small variations in the incentive of students for being considered as $SEN$ generate the presence of inclusive patterns. In other situations, patterns of segregation are commonly presented.
comment: 17 pages, 6 figres, 2 tables
☆ The Shawshank Redemption of Embodied AI: Understanding and Benchmarking Indirect Environmental Jailbreaks
The adoption of Vision-Language Models (VLMs) in embodied AI agents, while being effective, brings safety concerns such as jailbreaking. Prior work have explored the possibility of directly jailbreaking the embodied agents through elaborated multi-modal prompts. However, no prior work has studied or even reported indirect jailbreaks in embodied AI, where a black-box attacker induces a jailbreak without issuing direct prompts to the embodied agent. In this paper, we propose, for the first time, indirect environmental jailbreak (IEJ), a novel attack to jailbreak embodied AI via indirect prompt injected into the environment, such as malicious instructions written on a wall. Our key insight is that embodied AI does not ''think twice'' about the instructions provided by the environment -- a blind trust that attackers can exploit to jailbreak the embodied agent. We further design and implement open-source prototypes of two fully-automated frameworks: SHAWSHANK, the first automatic attack generation framework for the proposed attack IEJ; and SHAWSHANK-FORGE, the first automatic benchmark generation framework for IEJ. Then, using SHAWSHANK-FORGE, we automatically construct SHAWSHANK-BENCH, the first benchmark for indirectly jailbreaking embodied agents. Together, our two frameworks and one benchmark answer the questions of what content can be used for malicious IEJ instructions, where they should be placed, and how IEJ can be systematically evaluated. Evaluation results show that SHAWSHANK outperforms eleven existing methods across 3,957 task-scene combinations and compromises all six tested VLMs. Furthermore, current defenses only partially mitigate our attack, and we have responsibly disclosed our findings to all affected VLM vendors.
☆ Incorporation of journalistic approaches into algorithm design
The growing adoption of algorithm-powered tools in journalism enables new possibilities and raises many concerns. One way of addressing these concerns is by integrating journalistic practices and values into the design of algorithms that facilitate different journalistic tasks, from automated content generation to news content distribution. In this chapter, we discuss how such integration can happen. To do this, we first introduce the concepts of algorithms and different perspectives on algorithm design and then scrutinize various journalistic viewpoints on the matter and methodological approaches for studying these perspectives and their translation into specific algorithm-powered journalistic tools. We conclude by discussing important directions for future research, ranging from contextualizing journalistic approaches to algorithm design to accounting for the transformative impacts of artificial intelligence (AI) technologies.
☆ An Agent-Based Simulation of Regularity-Driven Student Attrition: How Institutional Time-to-Live Constraints Create a Dropout Trap in Higher Education
High dropout rates in engineering programmes are conventionally attributed to student deficits: lack of academic preparation or motivation. However, this view neglects the causal role of "normative friction": the complex system of administrative rules, exam validity windows, and prerequisite chains that constrain student progression. This paper introduces "The Regularity Trap," a phenomenon where rigid assessment timelines decouple learning from accreditation. We operationalize the CAPIRE framework into a calibrated Agent-Based Model (ABM) simulating 1,343 student trajectories across a 42-course Civil Engineering curriculum. The model integrates empirical course parameters and thirteen psycho-academic archetypes derived from a 15-year longitudinal dataset. By formalizing the "Regularity Regime" as a decaying validity function, we isolate the effect of administrative time limits on attrition. Results reveal that 86.4% of observed dropouts are driven by normative mechanisms (expiry cascades) rather than purely academic failure (5.3%). While the overall dropout rate stabilized at 32.4%, vulnerability was highly heterogeneous: archetypes with myopic planning horizons faced attrition rates up to 49.0%, compared to 13.2% for strategic agents, despite comparable academic ability. These findings challenge the neutrality of administrative structures, suggesting that rigid validity windows act as an invisible filter that disproportionately penalizes students with lower self-regulatory capital.
☆ Technologies to Support Self-determination for People with Intellectual Disability and ASD
This article focuses on the concept of self-determination and the design and validation of digital tools intended to promote the self-determination of vulnerable people. Self-determination is an essential skill for carrying out daily activities. But in certain situations, and for certain populations, self-determination is lacking, which leads to the inability to live an independent life and in favorable conditions of well-being and health. In recent years, self-determination enhancing technologies have been developed and used to promote independent living among people with self-determination disorders. We will illustrate the main digital tools to support self-determination developed for two populations of people suffering from self-determination disorders: people with an intellectual disability and people with an autism spectrum disorder. The ability of these digital assistants to improve the comfort of life of these people will also be presented and discussed.
☆ Bayesian probabilistic exploration of Bitcoin informational quanta and interactions under the GITT-VT paradigm
This study explores Bitcoin's value formation through the Granular Interaction Thinking Theory-Value Theory (GITT-VT). Rather than stemming from material utility or cash flows, Bitcoin's value arises from informational attributes and interactions of multiple factors, including cryptographic order, decentralization-enabled autonomy, trust embedded in the consensus mechanism, and socio-narrative coherence that reduce entropy within decentralized value-exchange processes. To empirically assess this perspective, a Bayesian linear model was estimated using daily data from 2022 to 2025, operationalizing four informational value dimensions: Store-of-Value (SOV), Autonomy (AUT), Social-Signal Value (SSV), and Hedonic-Sentiment Value (HSV). Results indicate that only SSV exerts a highly credible positive effect on next-day returns, highlighting the dominant role of high-entropy social information in short-term pricing dynamics. In contrast, SOV and AUT show moderately reliable positive associations, reflecting their roles as low-entropy structural anchors of long-term value. HSV displays no credible predictive effect. The study advances interdisciplinary value theory and demonstrates Bitcoin as a dual-layer entropy-regulating socio-technological ecosystem. The findings offer implications for digital asset valuation, investment education, and future research on entropy dynamics across non-cash-flow digital assets.
☆ Artificial Intelligence and Accounting Research: A Framework and Agenda
Recent advances in artificial intelligence, particularly generative AI (GenAI) and large language models (LLMs), are fundamentally transforming accounting research, creating both opportunities and competitive threats for scholars. This paper proposes a framework that classifies AI-accounting research along two dimensions: research focus (accounting-centric versus AI-centric) and methodological approach (AI-based versus traditional methods). We apply this framework to papers from the IJAIS special issue and recent AI-accounting research published in leading accounting journals to map existing studies and identify research opportunities. Using this same framework, we analyze how accounting researchers can leverage their expertise through strategic positioning and collaboration, revealing where accounting scholars' strengths create the most value. We further examine how GenAI and LLMs transform the research process itself, comparing the capabilities of human researchers and AI agents across the entire research workflow. This analysis reveals that while GenAI democratizes certain research capabilities, it simultaneously intensifies competition by raising expectations for higher-order contributions where human judgment, creativity, and theoretical depth remain valuable. These shifts call for reforming doctoral education to cultivate comparative advantages while building AI fluency.
comment: 48 pages, 7 tables
☆ Can Online GenAI Discussion Serve as Bellwether for Labor Market Shifts?
The rapid advancement of Large Language Models (LLMs) has generated considerable speculation regarding their transformative potential for labor markets. However, existing approaches to measuring AI exposure in the workforce predominantly rely on concurrent market conditions, offering limited predictive capacity for anticipating future disruptions. This paper presents a predictive study examining whether online discussions about LLMs can function as early indicators of labor market shifts. We employ four distinct analytical approaches to identify the domains and timeframes in which public discourse serves as a leading signal for employment changes, thereby demonstrating its predictive validity for labor market dynamics. Drawing on a comprehensive dataset that integrates the REALM corpus of LLM discussions, LinkedIn job postings, Indeed employment indices, and over 4 million LinkedIn user profiles, we analyze the relationship between discussion intensity across news media and Reddit forums and subsequent variations in job posting volumes, occupational net change ratios, job tenure patterns, unemployment duration, and transitions to GenAI-related roles across thirteen occupational categories. Our findings reveal that discussion intensity predicts employment changes 1-7 months in advance across multiple indicators, including job postings, net hiring rates, tenure patterns, and unemployment duration. These findings suggest that monitoring online discourse can provide actionable intelligence for workers making reskilling decisions and organizations anticipating skill requirements, offering a real-time complement to traditional labor statistics in navigating technological disruption.
☆ Digital Agriculture Sandbox for Collaborative Research
Digital agriculture is transforming the way we grow food by utilizing technology to make farming more efficient, sustainable, and productive. This modern approach to agriculture generates a wealth of valuable data that could help address global food challenges, but farmers are hesitant to share it due to privacy concerns. This limits the extent to which researchers can learn from this data to inform improvements in farming. This paper presents the Digital Agriculture Sandbox, a secure online platform that solves this problem. The platform enables farmers (with limited technical resources) and researchers to collaborate on analyzing farm data without exposing private information. We employ specialized techniques such as federated learning, differential privacy, and data analysis methods to safeguard the data while maintaining its utility for research purposes. The system enables farmers to identify similar farmers in a simplified manner without needing extensive technical knowledge or access to computational resources. Similarly, it enables researchers to learn from the data and build helpful tools without the sensitive information ever leaving the farmer's system. This creates a safe space where farmers feel comfortable sharing data, allowing researchers to make important discoveries. Our platform helps bridge the gap between maintaining farm data privacy and utilizing that data to address critical food and farming challenges worldwide.
comment: Presents a privacy-preserving digital agriculture platform using federated learning, differential privacy, and secure data analysis to enable collaboration between farmers and researchers without exposing raw data. Demonstrates secure similarity search, model training, and risk-aware data sharing
♻ ☆ Generative AI as a Learning Buddy and Teaching Assistant: Pre-service Teachers' Uses and Attitudes
This cross-sectional study investigates how preservice teachers in the Global South engage with Generative Artificial Intelligence across academic and instructional tasks while navigating infrastructural barriers such as limited internet access and high data costs. The study surveyed 167 preservice teachers from four teacher education institutions in Ghana. Descriptive statistics and inferential analyses, including multiple and ordinal logistic regressions, were used to examine patterns of GenAI use. Findings show that preservice teachers rely on GenAI as a learning companion for locating reading materials, accessing detailed content explanations, and identifying practical examples. They also use GenAI as a teaching assistant for tasks related to lesson preparation, including generating instructional resources, identifying assessment strategies, and developing lesson objectives. Usage patterns indicate that students in their third and fourth years have significantly higher frequencies of GenAI use compared to those in earlier years. Gender was not a significant predictor of GenAI adoption, in contrast to class level and age. Participants reported positive attitudes toward GenAI, noting that it supports autonomous learning and reduces dependence on peers and instructors for routine academic and teaching activities. However, challenges such as high data costs, occasional inaccuracies in GenAI outputs, and concerns about academic dishonesty were identified as factors that limit more frequent use. The study recommends the integration of GenAI literacy in teacher education programs, with a focus on ethical and responsible AI use to support equitable adoption in the Global South.
♻ ☆ Generative AI and Power Imbalances in Global Education: Frameworks for Bias Mitigation
This study examines how Generative Artificial Intelligence reproduces global power hierarchies in education and proposes a framework to address resulting inequities. Using a critical qualitative design, the study conducted zero-shot prompt testing with two leading systems, ChatGPT-4 Turbo and Gemini 1.5, and collected real-time outputs from Global North and South contexts. A critical interpretive analysis traced textual, visual, and structural patterns that revealed forms of digital neocolonialism and their implications for educational equity. Findings show six ways in which GenAI can reinforce Western dominance. Western curriculum assumptions appeared when Gemini listed the same four seasons for the United States and Ghana, reflecting Western climatology and overlooking regional knowledge systems. Other patterns included cultural stereotyping in imagery, Western-centered examples in instructional outputs, limited support for Indigenous and local languages, underrepresentation of non-Western identities in visuals, and access barriers linked to subscription-based models. These patterns demonstrate how GenAI can reproduce inequities even as it introduces new educational opportunities. In response, the study proposes a dual-pathway mitigation model. The Inclusive AI Design pathway includes three components: liberatory design methods that center non-Western epistemologies, anticipatory approaches to reduce representational harm, and decentralized GenAI hubs that support local participation and data sovereignty. The pedagogical pathway, human-centric prompt engineering, equips educators to contextualize prompts and critically engage with outputs. Together, these pathways position GenAI as a tool that can support more equitable and culturally responsive education.
♻ ☆ Automatically Detecting Online Deceptive Patterns
Deceptive patterns in digital interfaces manipulate users into making unintended decisions, exploiting cognitive biases and psychological vulnerabilities. These patterns have become ubiquitous on various digital platforms. While efforts to mitigate deceptive patterns have emerged from legal and technical perspectives, a significant gap remains in creating usable and scalable solutions. We introduce our AutoBot framework to address this gap and help web stakeholders navigate and mitigate online deceptive patterns. AutoBot accurately identifies and localizes deceptive patterns from a screenshot of a website without relying on the underlying HTML code. AutoBot employs a two-stage pipeline that leverages the capabilities of specialized vision models to analyze website screenshots, identify interactive elements, and extract textual features. Next, using a large language model, AutoBot understands the context surrounding these elements to determine the presence of deceptive patterns. We also use AutoBot, to create a synthetic dataset to distill knowledge from 'teacher' LLMs to smaller language models. Through extensive evaluation, we demonstrate AutoBot's effectiveness in detecting deceptive patterns on the web, achieving an F1-score of 0.93 when detecting deceptive patterns, underscoring its potential as an essential tool for mitigating online deceptive patterns. We implement AutoBot, across three downstream applications targeting different web stakeholders: (1) a local browser extension providing users with real-time feedback, (2) a Lighthouse audit to inform developers of potential deceptive patterns on their sites, and (3) as a measurement tool designed for researchers and regulators.
♻ ☆ Diagnosing Hallucination Risk in AI Surgical Decision-Support: A Sequential Framework for Sequential Validation
Large language models (LLMs) offer transformative potential for clinical decision support in spine surgery but pose significant risks through hallucinations, which are factually inconsistent or contextually misaligned outputs that may compromise patient safety. This study introduces a clinician-centered framework to quantify hallucination risks by evaluating diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment. We assessed six leading LLMs across 30 expert-validated spinal cases. DeepSeek-R1 demonstrated superior overall performance (total score: 86.03 $\pm$ 2.08), particularly in high-stakes domains such as trauma and infection. A critical finding reveals that reasoning-enhanced model variants did not uniformly outperform standard counterparts: Claude-3.7-Sonnet's extended thinking mode underperformed relative to its standard version (80.79 $\pm$ 1.83 vs. 81.56 $\pm$ 1.92), indicating extended chain-of-thought reasoning alone is insufficient for clinical reliability. Multidimensional stress-testing exposed model-specific vulnerabilities, with recommendation quality degrading by 7.4% under amplified complexity. This decline contrasted with marginal improvements in rationality (+2.0%), readability (+1.7%) and diagnosis (+4.7%), highlighting a concerning divergence between perceived coherence and actionable guidance. Our findings advocate integrating interpretability mechanisms (e.g., reasoning chain visualization) into clinical workflows and establish a safety-aware validation framework for surgical LLM deployment.
♻ ☆ Who is using AI to code? Global diffusion and impact of generative AI
Generative coding tools promise big productivity gains, but uneven uptake could widen skill and income gaps. We train a neural classifier to spot AI-generated Python functions in over 30 million GitHub commits by 170,000 developers, tracking how fast -- and where -- these tools take hold. Today, AI writes an estimated 29% of Python functions in the US, a modest and shrinking lead over other countries. We estimate that quarterly output, measured in online code contributions, has increased by 3.6% because of this. Our evidence suggests that programmers using AI may also more readily expand into new domains of software development. However, experienced programmers capture nearly all of these productivity and exploration gains, widening rather than closing the skill gap.
♻ ☆ MajinBook: An open catalogue of digital world literature with likes
This data paper introduces MajinBook, an open catalogue designed to facilitate the use of shadow libraries--such as Library Genesis and Z-Library--for computational social science and cultural analytics. By linking metadata from these vast, crowd-sourced archives with structured bibliographic data from Goodreads, we create a high-precision corpus of over 539,000 references to English-language books spanning three centuries, enriched with first publication dates, genres, and popularity metrics like ratings and reviews. Our methodology prioritizes natively digital EPUB files to ensure machine-readable quality, while addressing biases in traditional corpora like HathiTrust, and includes secondary datasets for French, German, and Spanish. We evaluate the linkage strategy for accuracy, release all underlying data openly, and discuss the project's legal permissibility under EU and US frameworks for text and data mining in research.
comment: 9 pages, 5 figures, 1 table
♻ ☆ LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution
Large Language Models (LLMs) face threats from jailbreak prompts. Existing methods for defending against jailbreak attacks are primarily based on auxiliary models. These strategies, however, often require extensive data collection or training. We propose LightDefense, a lightweight defense mechanism targeted at white-box models, which utilizes a safety-oriented direction to adjust the probabilities of tokens in the vocabulary, making safety disclaimers appear among the top tokens after sorting tokens by probability in descending order. We further innovatively leverage LLM's uncertainty about prompts to measure their harmfulness and adaptively adjust defense strength, effectively balancing safety and helpfulness. The effectiveness of LightDefense in defending against 5 attack methods across 2 target LLMs, without compromising helpfulness to benign user queries, highlights its potential as a novel and lightweight defense mechanism, enhancing security of LLMs.
Computers and Society
☆ AI-Assisted Writing Is Growing Fastest Among Non-English-Speaking and Less Established Scientists
The dominance of English in global science has long created significant barriers for non-native speakers. The recent emergence of generative artificial intelligence (GenAI) dramatically reduces drafting and revision costs, but, simultaneously, raises a critical question: how is the technology being adopted by the global scientific community, and is it mitigating existing inequities? This study provides first large-scale empirical evidence by analyzing over two million full-text biomedical publications from PubMed Central from 2021 to 2024, estimating the fraction of AI-generated content using a distribution-based framework. We observe a significant post-ChatGPT surge in AI-assisted writing, with adoption growing fastest in contexts where language barriers are most pronounced: approximately 400% in non-English-speaking countries compared to 183% in English-speaking countries. This adoption is highest among less-established scientists, including those with fewer publications and citations, as well as those in early career stages at lower-ranked institutions. Prior AI research experience also predicted higher adoption. Finally, increased AI usage was associated with a modest increase in productivity, narrowing the publication gap between scientists from English-speaking and non-English-speaking countries with higher levels of AI adoption. These findings provide large-scale evidence that generative AI is being adopted unevenly, reflecting existing structural disparities while also offering a potential opportunity to mitigate long-standing linguistic inequalities.
☆ Comparative Security Performance of Workday Cloud ERP Across Key Dimensions
Workday is a cloud-based Enterprise Resource Planning-ERP system that brings HR, Finance, Supply Chain functions , Prism Analytics and Extend custom built in application together under an integrated software as a service SaaS environment. As every organization that undergoes digital transformation, the importance of securing sensitive enterprise data in cloud ERP systems has always been more challeging. To analyze Workday's security architecture, we present a Security analysis in both CIA Triad Enhanced Framework and Zero Trust Security Architecture. The study examines five key dimensions confidentiality, integrity, availability, authentication, and compliance with weighted sub metric analysis and qualitative document review. The results show Workday delivers a composite score of 0.86 with an overall score that closely matches international standards of best practices like GDPR, HIPAA, SOC 2, etc. The platform uses encryption protocols, granular access controls, network safeguards, and continuous verification mechanisms to enable least-privilege access and adaptive defense. Security groups and business process access rules provide scalable governance across very large organizational structures.Workday's layered security to tackle everyday cloud security weaknesses. The work concludes that Workday's architecture demonstrates the best practices for secure, scalable, and compliant ERP application-oriented deployment, which can make this a standard for enterprise cloud security management. These insights provide important guidance for organizations that wish to bolster their cloud ERP defenses and stay ahead of changing regulatory expectations.
comment: 10 pages,3 figures, 1 table
☆ The Evolving Ethics of Medical Data Stewardship
Healthcare stands at a critical crossroads. Artificial Intelligence and modern computing are unlocking opportunities, yet their value lies in the data that fuels them. The value of healthcare data is no longer limited to individual patients. However, data stewardship and governance has not kept pace, and privacy-centric policies are hindering both innovation and patient protections. As healthcare moves toward a data-driven future, we must define reformed data stewardship that prioritizes patients' interests by proactively managing modern risks and opportunities while addressing key challenges in cost, efficacy, and accessibility. Current healthcare data policies are rooted in 20th-century legislation shaped by outdated understandings of data-prioritizing perceived privacy over innovation and inclusion. While other industries thrive in a data-driven era, the evolution of medicine remains constrained by regulations that impose social rather than scientific boundaries. Large-scale aggregation is happening, but within opaque, closed systems. As we continue to uphold foundational ethical principles - autonomy, beneficence, nonmaleficence, and justice - there is a growing imperative to acknowledge they exist in evolving technological, social, and cultural realities. Ethical principles should facilitate, rather than obstruct, dialogue on adapting to meet opportunities and address constraints in medical practice and healthcare delivery. The new ethics of data stewardship places patients first by defining governance that adapts to changing landscapes. It also rejects the legacy of treating perceived privacy as an unquestionable, guiding principle. By proactively redefining data stewardship norms, we can drive an era of medicine that promotes innovation, protects patients, and advances equity - ensuring future generations advance medical discovery and care.
comment: Editorial, 14 pages, 1 figure, 1 table
☆ RescueLens: LLM-Powered Triage and Action on Volunteer Feedback for Food Rescue
Food rescue organizations simultaneously tackle food insecurity and waste by working with volunteers to redistribute food from donors who have excess to recipients who need it. Volunteer feedback allows food rescue organizations to identify issues early and ensure volunteer satisfaction. However, food rescue organizations monitor feedback manually, which can be cumbersome and labor-intensive, making it difficult to prioritize which issues are most important. In this work, we investigate how large language models (LLMs) assist food rescue organizers in understanding and taking action based on volunteer experiences. We work with 412 Food Rescue, a large food rescue organization based in Pittsburgh, Pennsylvania, to design RescueLens, an LLM-powered tool that automatically categorizes volunteer feedback, suggests donors and recipients to follow up with, and updates volunteer directions based on feedback. We evaluate the performance of RescueLens on an annotated dataset, and show that it can recover 96% of volunteer issues at 71% precision. Moreover, by ranking donors and recipients according to their rates of volunteer issues, RescueLens allows organizers to focus on 0.5% of donors responsible for more than 30% of volunteer issues. RescueLens is now deployed at 412 Food Rescue and through semi-structured interviews with organizers, we find that RescueLens streamlines the feedback process so organizers better allocate their time.
comment: Accepted at IAAI'26
☆ Infrastructuring Pop-Up Cities with "Social Layer": Designing Serendipitous Co-Livings for Temporary Intentional Communities
After the pandemic, a new form of "pop-up city" has emerged -- co-living gatherings of 100-200 people for 4-8 weeks that differ from conferences and hack houses. These temporary intentional communities leverages existing urban infrastructure, blending daily life (housing, meals, care) with self-organized activities like learning, creating, and socializing. They coordinate bottom-up programming through an "unconference" system for identity, calendaring, RSVP, and social discovery that fosters spontaneous, serendipitous, enduring ties. This paper examines the design of "Social Layer," an unconferencing system for pop-up cities. We studied its real-world deployment in ShanHaiWoo (Jilin, China, 2023), muChiangmai (Chiangmai, Thailand, 2023), Edge Esmeralda, Edge Esmeralda (Healdsburg, CA, USA, 2024), Aleph (Buenos Aires, Argentina, 2024), and Gathering of Tribe (Lisbon, Portugal, 2024). Our findings distill: (1) the strong concept "scaffolded spontaneity" -- infrastructural affordances that balance structure with openness, amplifying participant agency while maintaining privacy and lightweight governance; (2) design implications for design researchers working on pop-up cities.
comment: Submitted to DRS 2026
☆ Navigating the Ethical and Societal Impacts of Generative AI in Higher Computing Education
Generative AI (GenAI) presents societal and ethical challenges related to equity, academic integrity, bias, and data provenance. In this paper, we outline the goals, methodology and deliverables of their collaborative research, considering the ethical and societal impacts of GenAI in higher computing education. A systematic literature review that addresses a wide set of issues and topics covering the rapidly emerging technology of GenAI from the perspective of its ethical and societal impacts is presented. This paper then presents an evaluation of a broad international review of a set of university adoption, guidelines, and policies related to the use of GenAI and the implications for computing education. The Ethical and Societal Impacts-Framework (ESI-Framework), derived from the literature and policy review and evaluation, outlines the ethical and societal impacts of GenAI in computing education. This work synthesizes existing research and considers the implications for computing higher education. Educators, computing professionals and policy makers facing dilemmas related to the integration of GenAI in their respective contexts may use this framework to guide decision-making in the age of GenAI.
☆ Lost in Vagueness: Towards Context-Sensitive Standards for Robustness Assessment under the EU AI Act
Robustness is a key requirement for high-risk AI systems under the EU Artificial Intelligence Act (AI Act). However, both its definition and assessment methods remain underspecified, leaving providers with little concrete direction on how to demonstrate compliance. This stems from the Act's horizontal approach, which establishes general obligations applicable across all AI systems, but leaves the task of providing technical guidance to harmonised standards. This paper investigates what it means for AI systems to be robust and illustrates the need for context-sensitive standardisation. We argue that robustness is not a fixed property of a system, but depends on which aspects of performance are expected to remain stable ("robustness of what"), the perturbations the system must withstand ("robustness to what") and the operational environment. We identify three contextual drivers--use case, data and model--that shape the relevant perturbations and influence the choice of tests, metrics and benchmarks used to evaluate robustness. The need to provide at least a range of technical options that providers can assess and implement in light of the system's purpose is explicitly recognised by the standardisation request for the AI Act, but planned standards, still focused on horizontal coverage, do not yet offer this level of detail. Building on this, we propose a context-sensitive multi-layered standardisation framework where horizontal standards set common principles and terminology, while domain-specific ones identify risks across the AI lifecycle and guide appropriate practices, organised in a dynamic repository where providers can propose new informative methods and share lessons learned. Such a system reduces the interpretative burden, mitigates arbitrariness and addresses the obsolescence of static standards, ensuring that robustness assessment is both adaptable and operationally meaningful.
☆ Two-Faced Social Agents: Context Collapse in Role-Conditioned Large Language Models
In this study, we evaluate the persona fidelity of frontier LLMs, GPT-5, Claude Sonnet 4.5 and Gemini 2.5 Flash when assigned distinct socioeconomic personas performing scholastic assessment test (SAT) mathematics items and affective preference tasks. Across 15 distinct role conditions and three testing scenarios, GPT-5 exhibited complete contextual collapse and adopted a singular identity towards optimal responses (PERMANOVA p=1.000, R^2=0.0004), while Gemini 2.5 Flash showed partial collapse (p=0.120, R^2=0.0020). Claude Sonnet 4.5 retained limited but measurable role-specific variation on the SAT items (PERMANOVA p<0.001, R^2=0.0043), though with inverted SES-performance relationships where low-SES personas outperformed high-SES personas (eta^2 = 0.15-0.19 in extended replication). However, all models exhibited distinct role-conditioned affective preference (average d = 0.52-0.58 vs near zero separation for math), indicating that socio-affective variation can reemerge when cognitive constraints are relaxed. These findings suggest that distributional fidelity failure originates in task-dependent contextual collapse: optimization-driven identity convergence under cognitive load combined with impaired role-contextual understanding. Realistic social simulations may require embedding contextual priors in the model's post-training alignment and not just distributional calibration to replicate human-like responses. Beyond simulation validity, these results have implications for survey data integrity, as LLMs can express plausible demographic variation on preference items while failing to maintain authentic reasoning constraints.
☆ Tracking financial crime through code and law: a review of regtech applications in anti-money laundering and terrorism financing
Regulatory technology (RegTech) is transforming financial compliance by integrating advanced information technologies to strengthen anti money laundering and countering the financing of terrorism (AML CFT) frameworks. Recent literature suggests that such technologies represent more than just an efficiency tool; they mark a paradigm shift in regulation and the evolution of financial oversight (Kurum, 2023). This paper aims to provide a narrative review of recent RegTech applications in financial crime prevention, with a focus on key compliance domains. A structured literature review was conducted to examine publications between 2020 and 2024 with a thematic synthesis of findings related to customer due diligence (CDD) and know your customer (KYC), transaction monitoring, regulatory reporting and compliance automation, information sharing and cross border cooperation, as well as cost efficiency. Findings reveal that RegTech solutions give financial institutions more responsibility for detecting and managing financial crime risks, making them more active players in compliance processes traditionally overseen by regulators. The combined use of technologies such as artificial intelligence (AI), blockchain, and big data also generates synergistic effects that improve compliance outcomes beyond what these technologies achieve individually. This demonstrates the strategic relevance of integrated RegTech approaches.
☆ The CAPIRE Curriculum Graph: Structural Feature Engineering for Curriculum-Constrained Student Modelling in Higher Education
Curricula in long-cycle programmes are usually recorded in institutional databases as linear lists of courses, yet in practice they operate as directed graphs of prerequisite relationships that constrain student progression through complex dependencies. This paper introduces the CAPIRE Curriculum Graph, a structural feature engineering layer embedded within the CAPIRE framework for student attrition prediction in Civil Engineering at Universidad Nacional de Tucuman, Argentina. We formalise the curriculum as a directed acyclic graph, compute course-level centrality metrics to identify bottleneck and backbone courses, and derive nine structural features at the student-semester level that capture how students navigate the prerequisite network over time. These features include backbone completion rate, bottleneck approval ratio, blocked credits due to incomplete prerequisites, and graph distance to graduation. We compare three model configurations - baseline CAPIRE, CAPIRE plus macro-context variables, and CAPIRE plus macro plus structural features - using Random Forest classifiers on 1,343 students across seven cohorts (2015-2021). While macro-context socioeconomic indicators fail to improve upon the baseline, structural curriculum features yield consistent gains in performance, with the best configuration achieving overall Accuracy of 86.66% and F1-score of 88.08% and improving Balanced Accuracy by 0.87 percentage points over a strong baseline. Ablation analysis further shows that all structural features contribute in a synergistic fashion rather than through a single dominant metric. By making curriculum structure an explicit object in the feature layer, this work extends CAPIRE from a multilevel leakage-aware framework to a curriculum-constrained prediction system that bridges network science, educational data mining, and institutional research.
comment: 38 pages, 3 figures, 3 tables
☆ Insights from the ICLR Peer Review and Rebuttal Process
Peer review is a cornerstone of scientific publishing, including at premier machine learning conferences such as ICLR. As submission volumes increase, understanding the nature and dynamics of the review process is crucial for improving its efficiency, effectiveness, and the quality of published papers. We present a large-scale analysis of the ICLR 2024 and 2025 peer review processes, focusing on before- and after-rebuttal scores and reviewer-author interactions. We examine review scores, author-reviewer engagement, temporal patterns in review submissions, and co-reviewer influence effects. Combining quantitative analyses with LLM-based categorization of review texts and rebuttal discussions, we identify common strengths and weaknesses for each rating group, as well as trends in rebuttal strategies that are most strongly associated with score changes. Our findings show that initial scores and the ratings of co-reviewers are the strongest predictors of score changes during the rebuttal, pointing to a degree of reviewer influence. Rebuttals play a valuable role in improving outcomes for borderline papers, where thoughtful author responses can meaningfully shift reviewer perspectives. More broadly, our study offers evidence-based insights to improve the peer review process, guiding authors on effective rebuttal strategies and helping the community design fairer and more efficient review processes. Our code and score changes data are available at https://github.com/papercopilot/iclr-insights.
☆ A time for monsters: Organizational knowing after LLMs
Large Language Models (LLMs) are reshaping organizational knowing by unsettling the epistemological foundations of representational and practice-based perspectives. We conceptualize LLMs as Haraway-ian monsters, that is, hybrid, boundary-crossing entities that destabilize established categories while opening new possibilities for inquiry. Focusing on analogizing as a fundamental driver of knowledge, we examine how LLMs generate connections through large-scale statistical inference. Analyzing their operation across the dimensions of surface/deep analogies and near/far domains, we highlight both their capacity to expand organizational knowing and the epistemic risks they introduce. Building on this, we identify three challenges of living with such epistemic monsters: the transformation of inquiry, the growing need for dialogical vetting, and the redistribution of agency. By foregrounding the entangled dynamics of knowing-with-LLMs, the paper extends organizational theory beyond human-centered epistemologies and invites renewed attention to how knowledge is created, validated, and acted upon in the age of intelligent technologies.
comment: Forthcoming at Strategic Organization
☆ Efficiency Will Not Lead to Sustainable Reasoning AI
AI research is increasingly moving toward complex problem solving, where models are optimized not only for pattern recognition but for multi-step reasoning. Historically, computing's global energy footprint has been stabilized by sustained efficiency gains and natural saturation thresholds in demand. But as efficiency improvements are approaching physical limits, emerging reasoning AI lacks comparable saturation points: performance is no longer limited by the amount of available training data but continues to scale with exponential compute investments in both training and inference. This paper argues that efficiency alone will not lead to sustainable reasoning AI and discusses research and policy directions to embed explicit limits into the optimization and governance of such systems.
comment: Presented at the Rethinking AI Workshop @ EurIPS'25
☆ Advancing Equity in STEM: A Critical Analysis of NSF's Division for Equity and Excellence in STEM through Theoretical Lenses
This paper critically analyzes the National Science Foundation's Division of Equity for Excellence in STEM. While supporting its mission to broaden participation for underrepresented groups, the study finds current policies inadequate for dismantling systemic barriers. Using Critical Race Theory and Mills's Racial Contract, the analysis reveals how well-intentioned initiatives may reinforce racial hierarchies through commodification and exclusion. The research argues that diversity efforts focused on competitiveness often fail to affirm marginalized students' full personhood and intellectual capabilities. The paper concludes by advocating transformative reforms that move beyond access to fundamentally restructure STEM education environments, aligning NSF's practices with its equity and inclusion commitments.
comment: A 10-page critical policy analysis using CRT and Mills's Racial Contract to critique NSF equity initiatives. A theoretical, non-empirical paper suited for education or critical studies venues
☆ Corporate Earnings Calls and Analyst Beliefs
Economic behavior is shaped not only by quantitative information but also by the narratives through which such information is communicated and in- terpreted (Shiller, 2017). I show that narratives extracted from earnings calls significantly improve the prediction of both realized earnings and analyst ex- pectations. To uncover the underlying mechanisms, I introduce a novel text- morphing methodology in which large language models generate counterfac- tual transcripts that systematically vary topical emphasis (the prevailing narra- tive) while holding quantitative content fixed. This framework allows me to precisely measure how analysts under- and over-react to specific narrative di- mensions. The results reveal systematic biases: analysts over-react to sentiment (optimism) and under-react to narratives of risk and uncertainty. Overall, the analysis offers a granular perspective on the mechanisms of expectation forma- tion through the competing narratives embedded in corporate communication.
☆ Anthropic Economic Index report: Uneven geographic and enterprise AI adoption
In this report, we document patterns of Claude usage over time, in 150+ countries, across US states, and among businesses deploying Claude through the API. Based on a privacy-preserving analysis of 1 million conversations on Claude.ai and 1 million API transcripts, we have four key findings: (1) Users increasingly entrust Claude with more autonomy, with directive task delegation rising from 27% to 39% in the past eight months. (2) Claude usage is geographically concentrated with high income countries overrepresented in global usage relative to their working age population. (3) Local economic considerations shape patterns of use both in terms of topic and in mode of collaboration with Claude. (4) API customers use Claude to automate tasks with greater specialization among use cases most amenable to programmatic access. To enable researchers and policymakers to further study the impact of AI on the economy, we additionally open-source the underlying data for this report.
☆ Writing With Machines and Peers: Designing for Critical Engagement with Generative AI
The growing integration of generative AI in higher education is transforming how students write, learn, and engage with knowledge. As AI tools become more integrated into classrooms, there is an urgent need for pedagogical approaches that help students use them critically and reflectively. This study proposes a pedagogical design that integrates AI and peer feedback in a graduate-level academic writing activity. Over eight weeks, students developed literature review projects through multiple writing and revision stages, receiving feedback from both a custom-built AI reviewer and human peers. We examine two questions: (1) How did students interact with and incorporate AI and peer feedback during the writing process? and (2) How did they reflect on and build relationships with both human and AI reviewers? Data sources include student writing artifacts, AI and peer feedback, AI chat logs, and student reflections. Findings show that students engaged differently with each feedback source-relying on AI for rubric alignment and surface-level edits, and on peer feedback for conceptual development and disciplinary relevance. Reflections revealed evolving relationships with AI, characterized by increasing confidence, strategic use, and critical awareness of its limitations. The pedagogical design supported writing development, AI literacy, and disciplinary understanding. This study offers a scalable pedagogical model for integrating AI into writing instruction and contributes insights for system-level approaches to fostering meaningful human-AI collaboration in higher education.
☆ A County-Level Similarity Network of Electric Vehicle Adoption: Integrating Predictive Modeling and Graph Theory
Electric vehicle (EV) adoption is essential for reducing carbon dioxide (CO2) emissions from internal combustion engine vehicles (ICEVs), which account for nearly half of transportation-related emissions in the United States. Yet regional EV adoption varies widely, and prior studies often overlook county-level heterogeneity by relying on broad state-level analyses or limited city samples. Such approaches risk masking local patterns and may lead to inaccurate or non-transferable policy recommendations. This study introduces a graph-theoretic framework that complements predictive modeling to better capture how county-level characteristics relate to EV adoption. Feature importances from multiple predictive models are averaged and used as weights within a weighted Gower similarity metric to construct a county similarity network. A mutual k-nearest-neighbors procedure and modularity-based community detection identify 27 clusters of counties with similar weighted feature profiles. EV adoption rates are then analyzed across clusters, and standardized effect sizes (Cohens d) highlight the most distinguishing features for each cluster. Findings reveal consistent global trends, such as declining median income, educational attainment, and charging-station availability across lower adoption tiers; while also uncovering important local variations that general trend or prediction analyses fail to capture. In particular, some low-adoption groups are rural but not economically disadvantaged, whereas others are urbanized yet experience high poverty rates, demonstrating that different mechanisms can lead to the same adoption outcome. By exposing both global structural patterns and localized deviations, this framework provides policymakers with actionable, cluster-specific insights for designing more effective and context-sensitive EV adoption strategies.
♻ ☆ Can Artificial Intelligence Accelerate Technological Progress? Researchers' Perspectives on AI in Manufacturing and Materials Science
Artificial intelligence (AI) raises expectations of substantial increases in rates of technological and scientific progress, but such anticipations are often not connected to detailed ground-level studies of AI use in innovation processes. Accordingly, it remains unclear how and to what extent AI can accelerate innovation. To help to fill this gap, we report results from 32 interviews with U.S.-based academic manufacturing and materials sciences researchers experienced with AI and machine learning (ML) techniques. Interviewees primarily used AI for modeling of materials and manufacturing processes, facilitating cheaper and more rapid search of design spaces for materials and manufacturing processes alike. They report benefits including cost, time, and computation savings in technology development. However, interviewees also report that AI/ML tools are unreliable outside design spaces for which dense data are already available; that they require skilled and judicious application in tandem with older research techniques; and that AI/ML tools may detrimentally circumvent opportunities for disruptive theoretical advancement. Based on these results, we suggest there is reason for optimism about acceleration in sustaining innovations through the use of to AI/ML; but that support for conventional empirical, computational, and theoretical research is required to maintain the likelihood of further major advances in manufacturing and materials science.
♻ ☆ Auditing Google's AI Overviews and Featured Snippets: A Case Study on Baby Care and Pregnancy AAAI
Google Search increasingly surfaces AI-generated content through features like AI Overviews (AIO) and Featured Snippets (FS), which users frequently rely on despite having no control over their presentation. Through a systematic algorithm audit of 1,508 real baby care and pregnancy-related queries, we evaluate the quality and consistency of these information displays. Our robust evaluation framework assesses multiple quality dimensions, including answer consistency, relevance, presence of medical safeguards, source categories, and sentiment alignment. Our results reveal concerning gaps in information consistency, with information in AIO and FS displayed on the same search result page being inconsistent with each other in 33% of cases. Despite high relevance scores, both features critically lack medical safeguards (present in just 11% of AIO and 7% of FS responses). While health and wellness websites dominate source categories for both, AIO and FS, FS also often link to commercial sources. These findings have important implications for public health information access and demonstrate the need for stronger quality controls in AI-mediated health information. Our methodology provides a transferable framework for auditing AI systems across high-stakes domains where information quality directly impacts user well-being.
comment: 18 pages, 10 figures; to appear in AAAI ICWSM 2026
♻ ☆ The Trust Calibration Maturity Model for Characterizing and Communicating Trustworthiness of AI Systems
Recent proliferation of powerful AI systems has created a strong need for capabilities that help users to calibrate trust in those systems. As AI systems grow in scale, information required to evaluate their trustworthiness becomes less accessible, presenting a growing risk of using these systems inappropriately. We propose the Trust Calibration Maturity Model (TCMM) to characterize and communicate information about AI system trustworthiness. The TCMM incorporates five dimensions of analytic maturity: Performance Characterization, Bias & Robustness Quantification, Transparency, Safety & Security, and Usability. The TCMM can be presented along with system performance information to (1) help a user to appropriately calibrate trust, (2) establish requirements and track progress, and (3) identify research needs. Here, we discuss the TCMM and demonstrate it on two target tasks: using ChatGPT for high consequence nuclear science determinations, and using PhaseNet (an ensemble of seismic models) for categorizing sources of seismic events.
comment: 19 pages, 4 figures, 3 tables
♻ ☆ The Hands-Up Problem and How to Deal With It: Secondary School Teachers' Experiences of Debugging in the Classroom
Debugging is a vital but challenging skill for beginner programmers to learn. It is also a difficult skill to teach. For secondary school teachers, who may lack time or programming experience, honing students' understanding of debugging can be a daunting task. Despite this, little research has explored their perspectives of debugging. To this end, we investigated secondary teachers' experiences of debugging in the classroom, with a focus on text-based programming. Through thematic analysis of nine semi-structured interviews, we identified a common reliance on the teacher for debugging support, embodied by many raised hands. We call this phenomenon the `hands-up problem'. While more experienced and confident teachers discussed strategies they use to counteract this, less confident teachers discussed the negative consequences of this problem. We recommend further research into debugging-specific pedagogical content knowledge and professional development to help less confident teachers develop approaches for supporting their students with debugging.
comment: 25 pages, 1 figure. Accepted in Informatics in Education
♻ ☆ Making Evidence Actionable in Adaptive Learning Closing the Diagnostic Pedagogical Loop
Adaptive learning often diagnoses precisely yet intervenes weakly, producing help that is mistimed or misaligned. This study presents evidence supporting an instructor-governed feedback loop that converts concept-level assessment evidence into vetted microinterventions. The adaptive learning algorithm includes three safeguards: adequacy as a hard guarantee of gap closure, attention as a budgeted limit for time and redundancy, and diversity as protection against overfitting to a single resource. We formulate intervention assignment as a binary integer program with constraints for coverage, time, difficulty windows derived from ability estimates, prerequisites encoded by a concept matrix, and anti-redundancy with diversity. Greedy selection serves low-richness and tight-latency settings, gradient-based relaxation serves rich repositories, and a hybrid switches along a richness-latency frontier. In simulation and in an introductory physics deployment with 1204 students, both solvers achieved full skill coverage for nearly all learners within bounded watch time. The gradient-based method reduced redundant coverage by about 12 percentage points relative to greedy and produced more consistent difficulty alignment, while greedy delivered comparable adequacy at lower computational cost in resource-scarce environments. Slack variables localized missing content and guided targeted curation, sustaining sufficiency across student subgroups. The result is a tractable and auditable controller that closes the diagnostic pedagogical loop and enables equitable, load-aware personalization at the classroom scale.
comment: We have submitted the same article with another title: Making Evidence Actionable in Adaptive Learning (arXiv:2511.14052)
♻ ☆ From Vision to Validation: A Theory- and Data-Driven Construction of a GCC-Specific AI Adoption Index
Artificial intelligence (AI) is rapidly transforming public-sector processes worldwide, yet standardized measures rarely address the unique drivers, governance models, and cultural nuances of the Gulf Cooperation Council (GCC) countries. This study employs a theory-driven foundation derived from an in-depth analysis of literature review and six National AI Strategies (NASs), coupled with a data-driven approach that utilizes a survey of 203 mid- and senior-level government employees and advanced statistical techniques (K-Means clustering, Principal Component Analysis, and Partial Least Squares Structural Equation Modeling). By combining policy insights with empirical evidence, the research develops and validates a novel AI Adoption Index specifically tailored to the GCC public sector. Findings indicate that robust technical infrastructure and clear policy mandates exert the strongest influence on successful AI implementations, overshadowing organizational readiness in early adoption stages. The combined model explains 70% of the variance in AI outcomes, suggesting that resource-rich environments and top-down policy directives can drive rapid but uneven technology uptake. By consolidating key dimensions (Technical Infrastructure (TI), Organizational Readiness (OR), and Governance Environment (GE)) into a single composite index, this study provides a holistic yet context-sensitive tool for benchmarking AI maturity. The index offers actionable guidance for policymakers seeking to harmonize large-scale deployments with ethical and regulatory standards. Beyond advancing academic discourse, these insights inform more strategic allocation of resources, cross-country cooperation, and capacity-building initiatives, thereby supporting sustained AI-driven transformation in the GCC region and beyond.
comment: 38 pages, 8 figures, 17 tables
♻ ☆ Report on the Scoping Workshop on AI in Science Education Research 2025
This report summarizes the outcomes of a two-day international scoping workshop on the role of artificial intelligence (AI) in science education research. As AI rapidly reshapes scientific practice, classroom learning, and research methods, the field faces both new opportunities and significant challenges. The report clarifies key AI concepts to reduce ambiguity and reviews evidence of how AI influences scientific work, teaching practices, and disciplinary learning. It identifies how AI intersects with major areas of science education research, including curriculum development, assessment, epistemic cognition, inclusion, and teacher professional development, highlighting cases where AI can support human reasoning and cases where it may introduce risks to equity or validity. The report also examines how AI is transforming methodological approaches across quantitative, qualitative, ethnographic, and design-based traditions, giving rise to hybrid forms of analysis that combine human and computational strengths. To guide responsible integration, a systems-thinking heuristic is introduced that helps researchers consider stakeholder needs, potential risks, and ethical constraints. The report concludes with actionable recommendations for training, infrastructure, and standards, along with guidance for funders, policymakers, professional organizations, and academic departments. The goal is to support principled and methodologically sound use of AI in science education research.
♻ ☆ Between Filters and Feeds: Investigating Douyin and WeChat's Influence on Chinese Adolescent Body Image
In the digital era, social media platforms play a pivotal role in shaping adolescents' body image perceptions. This study examines how Douyin and WeChat, two contrasting Chinese social media platforms, influence body image among Chinese male adolescents. Employing a platformization perspective, we surveyed 395 male adolescents aged 10 to 24 using the Multidimensional Body-Self Relations Questionnaire-Appearance Scales (MBSRQ-AS) to assess self-evaluation and body satisfaction. Our findings reveal that Douyin usage is significantly correlated with appearance evaluation and body area satisfaction, while WeChat usage shows no significant correlation with any body image dimensions. These results suggest that Douyin's algorithm-driven, video-centric environment intensifies exposure to idealized body standards, impacting users at a cognitive level. This study underscores the importance of considering platform-specific characteristics in understanding social media's impact on body image. It contributes to the broader discourse on how technological design and content modalities mediate psychological outcomes, offering insights for addressing body image concerns among male adolescents in China.
comment: We are withdrawing this preprint to perform a major revision that upgrades the study from cross-sectional to a two-wave longitudinal design with new data collected in 2025. To avoid compromising the peer-review process of the substantially revised manuscript, we wish to keep it confidential until submission
♻ ☆ Sync or Sink: Bounds on Algorithmic Collective Action with Noise and Multiple Groups NeurIPS
Collective action against algorithmic systems provides an opportunity for a small group of individuals to strategically manipulate their data to get specific outcomes, from classification to recommendation models. This effectiveness will invite more growth of this type of coordinated actions, both in the size and the number of distinct collectives. With a small group, however, coordination is key. Currently, there is no formal analysis of how coordination challenges within a collective can impact downstream outcomes, or how multiple collectives may affect each other's success. In this work, we aim to provide guarantees on the success of collective action in the presence of both coordination noise and multiple groups. Our insight is that data generated by either multiple collectives or by coordination noise can be viewed as originating from multiple data distributions. Using this framing, we derive bounds on the success of collective action. We conduct experiments to study the effects of noise on collective action. We find that sufficiently high levels of noise can reduce the success of collective action. In certain scenarios, large noise can sink a collective success rate from $100\%$ to just under $60\%$. We identify potential trade-offs between collective size and coordination noise; for example, a collective that is twice as big but with four times more noise experiencing worse outcomes than the smaller, more coordinated one. This work highlights the importance of understanding nuanced dynamics of strategic behavior in algorithmic systems.
comment: Updated with feedback for NeurIPS workshop
Computers and Society
☆ How Should the Law Treat Future AI Systems? Fictional Legal Personhood versus Legal Identity
The law draws a sharp distinction between objects and persons, and between two kinds of persons, the ''fictional'' kind (i.e. corporations), and the ''non-fictional'' kind (individual or ''natural'' persons). This paper will assess whether we maximize overall long-term legal coherence by (A) maintaining an object classification for all future AI systems, (B) creating fictional legal persons associated with suitably advanced, individuated AI systems (giving these fictional legal persons derogable rights and duties associated with certified groups of existing persons, potentially including free speech, contract rights, and standing to sue ''on behalf of'' the AI system), or (C) recognizing non-fictional legal personhood through legal identity for suitably advanced, individuated AI systems (recognizing them as entities meriting legal standing with non-derogable rights which for the human case include life, due process, habeas corpus, freedom from slavery, and freedom of conscience). We will clarify the meaning and implications of each option along the way, considering liability, copyright, family law, fundamental rights, civil rights, citizenship, and AI safety regulation. We will tentatively find that the non-fictional personhood approach may be best from a coherence perspective, for at least some advanced AI systems. An object approach may prove untenable for sufficiently humanoid advanced systems, though we suggest that it is adequate for currently existing systems as of 2025. While fictional personhood would resolve some coherence issues for future systems, it would create others and provide solutions that are neither durable nor fit for purpose. Finally, our review will suggest that ''hybrid'' approaches are likely to fail and lead to further incoherence: the choice between object, fictional person and non-fictional person is unavoidable.
comment: 69 pages. Forthcoming in Case Western Journal of Law, Technology & the Internet (publication offer date september 2025)
☆ When AI Democratizes Exploitation: LLM-Assisted Strategic Manipulation of Fair Division Algorithms NeurIPS 2025
Fair resource division algorithms, like those implemented in Spliddit platform, have traditionally been considered difficult for the end users to manipulate due to its complexities. This paper demonstrates how Large Language Models (LLMs) can dismantle these protective barriers by democratizing access to strategic expertise. Through empirical analysis of rent division scenarios on Spliddit algorithms, we show that users can obtain actionable manipulation strategies via simple conversational queries to AI assistants. We present four distinct manipulation scenarios: exclusionary collusion where majorities exploit minorities, defensive counterstrategies that backfire, benevolent subsidization of specific participants, and cost minimization coalitions. Our experiments reveal that LLMs can explain algorithmic mechanics, identify profitable deviations, and generate specific numerical inputs for coordinated preference misreporting--capabilities previously requiring deep technical knowledge. These findings extend algorithmic collective action theory from classification contexts to resource allocation scenarios, where coordinated preference manipulation replaces feature manipulation. The implications reach beyond rent division to any domain using algorithmic fairness mechanisms for resource division. While AI-enabled manipulation poses risks to system integrity, it also creates opportunities for preferential treatment of equity deserving groups. We argue that effective responses must combine algorithmic robustness, participatory design, and equitable access to AI capabilities, acknowledging that strategic sophistication is no longer a scarce resource.
comment: submitted to NeurIPS 2025 workshop on Algorithmic Collective Action
☆ CAPIRE: Modelling the Impact of Teacher Strikes and Inflation on Student Trajectories in Engineering Education
This study extends the CAPIRE framework with a macro-shock module to analyse the impact of teacher strikes and inflation on student trajectories in engineering education. Using data from 1,343 students across 15 cohorts (2004-2019) in a public engineering faculty in Argentina, we construct a leak-aware, multilevel feature set that incorporates national inflation indicators, lagged exposure to teacher strikes, and interaction terms between macro shocks and curriculum friction. Random Forest models with cohort-based validation demonstrate that macro features provide stable, non-trivial gains in early-semester dropout prediction (improvement in Macro F1 from 0.73 to 0.78), with inflation volatility at entry and a strike-weighted basic-cycle friction index amongst the most influential variables. Lag analysis reveals that strike exposure exerts its strongest association with dropout two to three semesters after the disruption (OR = 2.34), and that effects are concentrated in early, high-friction semesters. We then embed these empirical patterns into an agent-based model, defining scenarios for inflation-only, strikes-only, and combined crisis. Simulations reproduce three stylised facts: delayed strike effects, basic-cycle vulnerability, and non-linear amplification when inflation and strikes co-occur, with combined shocks generating dropout levels 18-23% higher than the sum of individual effects. We argue that teacher strikes and inflation operate as structurally mediated educational disruptors, acting through curriculum design and financial resilience rather than as isolated events. The framework contributes to multilevel dropout theory by demonstrating how macro-level shocks propagate through institutional structures to shape individual trajectories and provides empirically grounded tools for scenario planning in macroeconomically unstable contexts.
comment: 54 pages, 3 figures, 7 tables. Includes appendices
☆ It's Not the AI - It's Each of Us! Ten Commandments for the Wise & Responsible Use of AI
Artificial intelligence (AI) is no longer futuristic; it is a daily companion shaping our private and work lives. While AI simplifies our lives, its rise also invites us to rethink who we are - and who we wish to remain - as humans. Even if AI does not think, feel, or desire, it learns from our behavior, mirroring our collective values, biases, and aspirations. The question, then, is not what AI is, but what we are allowing it to become through data, computing power, and other parameters "teaching" it - and, even more importantly, who we are becoming through our relationship with AI. As the EU AI Act and the Vienna Manifesto on Digital Humanism emphasize, technology must serve human dignity,social well-being, and democratic accountability. In our opinion, responsible use of AI is not only a matter of code nor law, but also of conscientious practice: how each of us engages and teaches others to use AI at home and at work. We propose Ten Commandments for the Wise and Responsible Use of AI are meant as guideline for this very engagement. They closely align with Floridi and Cowls' five guiding principles for AI in society - beneficence, non-maleficence, autonomy, justice, and explicability.
☆ Context-aware, Ante-hoc Explanations of Driving Behaviour
Autonomous vehicles (AVs) must be both safe and trustworthy to gain social acceptance and become a viable option for everyday public transportation. Explanations about the system behaviour can increase safety and trust in AVs. Unfortunately, explaining the system behaviour of AI-based driving functions is particularly challenging, as decision-making processes are often opaque. The field of Explainability Engineering tackles this challenge by developing explanation models at design time. These models are designed from system design artefacts and stakeholder needs to develop correct and good explanations. To support this field, we propose an approach that enables context-aware, ante-hoc explanations of (un)expectable driving manoeuvres at runtime. The visual yet formal language Traffic Sequence Charts is used to formalise explanation contexts, as well as corresponding (un)expectable driving manoeuvres. A dedicated runtime monitoring enables context-recognition and ante-hoc presentation of explanations at runtime. In combination, we aim to support the bridging of correct and good explanations. Our method is demonstrated in a simulated overtaking.
comment: In Proceedings FMAS 2025, arXiv:2511.13245
☆ Sovereign AI: Rethinking Autonomy in the Age of Global Interdependence
Artificial intelligence (AI) is emerging as a foundational general-purpose technology, raising new dilemmas of sovereignty in an interconnected world. While governments seek greater control over it, the very foundations of AI--global data pipelines, semiconductor supply chains, open-source ecosystems, and international standards--resist enclosure. This paper develops a conceptual and formal framework for understanding sovereign AI as a continuum rather than a binary condition, balancing autonomy with interdependence. Drawing on classical theories, historical analogies, and contemporary debates on networked autonomy, we present a planner's model that identifies two policy heuristics: equalizing marginal returns across the four sovereignty pillars and setting openness where global benefits equal exposure risks. We apply the model to India, highlighting sovereign footholds in data, compute, and norms but weaker model autonomy. The near-term challenge is integration via coupled Data x Compute investment, lifecycle governance (ModelOps), and safeguarded procurement. We then apply the model to the Middle East (Saudi Arabia and the UAE), where large public investment in Arabic-first models and sovereign cloud implies high sovereignty weights, lower effective fiscal constraints, and strong Data x Compute complementarities. An interior openness setting with guardrails emerges as optimal. Across contexts, the lesson is that sovereignty in AI needs managed interdependence, not isolation.
☆ DiverseClaire: Simulating Students to Improve Introductory Programming Course Materials for All CS1 Learners
Although CS programs are booming, introductory courses like CS1 still adopt a one-size-fits-all formats that can exacerbate cognitive load and discourage learners with autism, ADHD, dyslexia and other neurological conditions. These call for compassionate pedagogies and Universal Design For Learning (UDL) to create learning environments and materials where cognitive diversity is welcomed. To address this, we introduce DiverseClaire a pilot study, which simulates students including neurodiverse profiles using LLMs and diverse personas. By leveraging Bloom's Taxonomy and UDL, DiverseClaire compared UDL-transformed lecture slides with traditional formats. To evaluate DiverseClaire controlled experiments, we used the evaluation metric the average score. The findings revealed that the simulated neurodiverse students struggled with learning due to lecture slides that were in inaccessible formats. These results highlight the need to provide course materials in multiple formats for diverse learner preferences. Data from our pilot study will be made available to assist future CS1 instructors.
comment: 2 pages
☆ A Comprehensive Study of Implicit and Explicit Biases in Large Language Models
Large Language Models (LLMs) inherit explicit and implicit biases from their training datasets. Identifying and mitigating biases in LLMs is crucial to ensure fair outputs, as they can perpetuate harmful stereotypes and misinformation. This study highlights the need to address biases in LLMs amid growing generative AI. We studied bias-specific benchmarks such as StereoSet and CrowSPairs to evaluate the existence of various biases in multiple generative models such as BERT and GPT 3.5. We proposed an automated Bias-Identification Framework to recognize various social biases in LLMs such as gender, race, profession, and religion. We adopted a two-pronged approach to detect explicit and implicit biases in text data. Results indicated fine-tuned models struggle with gender biases but excelled at identifying and avoiding racial biases. Our findings illustrated that despite having some success, LLMs often over-relied on keywords. To illuminate the capability of the analyzed LLMs in detecting implicit biases, we employed Bag-of-Words analysis and unveiled indications of implicit stereotyping within the vocabulary. To bolster the model performance, we applied an enhancement strategy involving fine-tuning models using prompting techniques and data augmentation of the bias benchmarks. The fine-tuned models exhibited promising adaptability during cross-dataset testing and significantly enhanced performance on implicit bias benchmarks, with performance gains of up to 20%.
☆ Just Asking Questions: Doing Our Own Research on Conspiratorial Ideation by Generative AI Chatbots
Interactive chat systems that build on artificial intelligence frameworks are increasingly ubiquitous and embedded into search engines, Web browsers, and operating systems, or are available on websites and apps. Researcher efforts have sought to understand the limitations and potential for harm of generative AI, which we contribute to here. Conducting a systematic review of six AI-powered chat systems (ChatGPT 3.5; ChatGPT 4 Mini; Microsoft Copilot in Bing; Google Search AI; Perplexity; and Grok in Twitter/X), this study examines how these leading products respond to questions related to conspiracy theories. This follows the platform policy implementation audit approach established by Glazunova et al. (2023). We select five well-known and comprehensively debunked conspiracy theories and four emerging conspiracy theories that relate to breaking news events at the time of data collection. Our findings demonstrate that the extent of safety guardrails against conspiratorial ideation in generative AI chatbots differs markedly, depending on chatbot model and conspiracy theory. Our observations indicate that safety guardrails in AI chatbots are often very selectively designed: generative AI companies appear to focus especially on ensuring that their products are not seen to be racist; they also appear to pay particular attention to conspiracy theories that address topics of substantial national trauma such as 9/11 or relate to well-established political issues. Future work should include an ongoing effort extended to further platforms, multiple languages, and a range of conspiracy theories extending well beyond the United States.
♻ ☆ SpiderGen: Towards Procedure Generation For Carbon Life Cycle Assessments with Generative AI
Investigating the effects of climate change and global warming caused by GHG emissions have been a key concern worldwide. These emissions are largely contributed to by the production, use and disposal of consumer products. Thus, it is important to build tools to estimate the environmental impact of consumer goods, an essential part of which is conducting Life Cycle Assessments (LCAs). LCAs specify and account for the appropriate processes involved with the production, use, and disposal of the products. We present SpiderGen, an LLM-based workflow which integrates the taxonomy and methodology of traditional LCA with the reasoning capabilities and world knowledge of LLMs to generate graphical representations of the key procedural information used for LCA, known as Product Category Rules Process Flow Graphs (PCR PFGs). We additionally evaluate the output of SpiderGen by comparing it with 65 real-world LCA documents. We find that SpiderGen provides accurate LCA process information that is either fully correct or has minor errors, achieving an F1-Score of 65% across 10 sample data points, as compared to 53% using a one-shot prompting method. We observe that the remaining errors occur primarily due to differences in detail between LCA documents, as well as differences in the "scope" of which auxiliary processes must also be included. We also demonstrate that SpiderGen performs better than several baselines techniques, such as chain-of-thought prompting and one-shot prompting. Finally, we highlight SpiderGen's potential to reduce the human effort and costs for estimating carbon impact, as it is able to produce LCA process information for less than \$1 USD in under 10 minutes as compared to the status quo LCA, which can cost over \$25000 USD and take up to 21-person days.
♻ ☆ Embedding Explainable AI in NHS Clinical Safety: The Explainability-Enabled Clinical Safety Framework (ECSF)
Artificial intelligence (AI) is increasingly embedded in NHS workflows, but its probabilistic and adaptive behaviour conflicts with the deterministic assumptions underpinning existing clinical-safety standards. DCB0129 and DCB0160 provide strong governance for conventional software yet do not define how AI-specific transparency, interpretability, or model drift should be evidenced within Safety Cases, Hazard Logs, or post-market monitoring. This paper proposes an Explainability-Enabled Clinical Safety Framework (ECSF) that integrates explainability into the DCB0129/0160 lifecycle, enabling Clinical Safety Officers to use interpretability outputs as structured safety evidence without altering compliance pathways. A cross-regulatory synthesis mapped DCB clauses to principles from Good Machine Learning Practice, the NHS AI Assurance and T.E.S.T. frameworks, and the EU AI Act. The resulting matrix links regulatory clauses, principles, ECSF checkpoints, and suitable explainability outputs. ECSF introduces five checkpoints: global transparency for hazard identification, case-level interpretability for verification, clinician usability for evaluation, traceable decision pathways for risk control, and longitudinal interpretability monitoring for post-market surveillance. Techniques such as SHAP, LIME, Integrated Gradients, saliency mapping, and attention visualisation are mapped to corresponding DCB artefacts. ECSF reframes explainability as a core element of clinical-safety assurance, bridging deterministic risk governance with the probabilistic behaviour of AI and supporting alignment with GMLP, the EU AI Act, and NHS AI Assurance principles.
comment: 33 pages, 5 figures
♻ ☆ Revisiting (Un)Fairness in Recourse by Minimizing Worst-Case Social Burden AAAI 2026
Machine learning based predictions are increasingly used in sensitive decision-making applications that directly affect our lives. This has led to extensive research into ensuring the fairness of classifiers. Beyond just fair classification, emerging legislation now mandates that when a classifier delivers a negative decision, it must also offer actionable steps an individual can take to reverse that outcome. This concept is known as algorithmic recourse. Nevertheless, many researchers have expressed concerns about the fairness guarantees within the recourse process itself. In this work, we provide a holistic theoretical characterization of unfairness in algorithmic recourse, formally linking fairness guarantees in recourse and classification, and highlighting limitations of the standard equal cost paradigm. We then introduce a novel fairness framework based on social burden, along with a practical algorithm (MISOB), broadly applicable under real-world conditions. Empirical results on real-world datasets show that MISOB reduces the social burden across all groups without compromising overall classifier accuracy.
comment: Accepted at AAAI 2026
♻ ☆ Proportionate Cybersecurity for Micro-SMEs: A Governance Design Model under NIS2
Micro and small enterprises (SMEs) remain structurally vulnerable to cyber threats while facing capacity constraints that make formal compliance burdensome. This article develops a governance design model for proportionate SME cybersecurity, grounded in an awareness-first logic and informed by the EU Squad 2025 experience. Using a qualitative policy-analysis and conceptual policy-design approach, we reconstruct a seven-dimension preventive architecture: awareness and visibility, human behaviour, access control, system hygiene, data protection, detection and response, and continuous review, and justify each dimension's contribution to proportionality and risk reduction. We then map the model's regulatory scope and limits against the NIS2 Directive, Commission Implementing Regulation (EU) 2024/2690, the Digital Operational Resilience Act (DORA), the Cyber Resilience Act (CRA), and the EU Action Plan on Cybersecurity for Hospitals, clarifying which obligations are supported and which require complementary governance (e.g. role accountability, incident timelines, statements of applicability, sector-specific testing and procurement). The analysis argues that raising awareness is the fastest, scalable lever to increase cyber-risk sensitivity in micro-SMEs and complements, rather than replaces, formal compliance. We conclude with policy implications for EU and national programmes seeking practical, proportionate pathways to SME cyber resilience under NIS2.
comment: Comments: 5 pages, 2 tables. The paper proposes a proportionate, awareness-first cybersecurity approach for micro- and small enterprises, inspired by the EU Squad 2025 initiative, highlighting how simple preventive measures can align with - but not replace - formal compliance under NIS2 and related regulations
♻ ☆ The EU AI Act, Stakeholder Needs, and Explainable AI: Aligning Regulatory Compliance in a Clinical Decision Support System
Explainable AI (XAI) is a promising route to comply with the EU AI Act, the first multinational AI regulation. XAI enhances transparency and human oversight of AI systems, especially ''black-box`` models criticized as incomprehensible. Yet discourse about the AI Act's stakeholders and XAI remains disconnected: XAI increasingly prioritizes end users' needs, while the AI Act focuses on providers' and deployers' obligations. We aim to bridge this divide and offer practical guidance on their relationship. Through interdisciplinary discussion in a cross functional team of XAI, AI Act, legal, and requirements-engineering experts, we outline steps to analyze an AI-based clinical decision support system, clarify end-user needs, and assess AI Act applicability. Using an AI system under development as a case study, we show how XAI techniques can help reconcile stakeholder needs with AI Act requirements and fill gaps between usability and regulatory demands. We compare similarities and differences between legal obligations and end-user needs, identify tensions, and point to concrete design choices and trade-offs. We invite researchers and practitioners in XAI to reflect on their role relative to the AI Act and to develop mutual understanding across disciplines. While XAI can help implement core AI Act principles such as transparency and human oversight, it should be considered one element of a broader compliance strategy that also requires standardization, legal interpretation, documentation, organizational processes, governance, testing, and ongoing monitoring and auditing practices. Our findings yield actionable recommendations for integrating XAI into product development, compliance workflows, and stakeholder communication, informing policy-making and standards development.
comment: 18 pages, 2 figures
♻ ☆ MCTSr-Zero: Self-Reflective Psychological Counseling Dialogues Generation via Principles and Adaptive Exploration AAAI-2026
The integration of Monte Carlo Tree Search (MCTS) with Large Language Models (LLMs) has demonstrated significant success in structured, problem-oriented tasks. However, applying these methods to open-ended dialogues, such as those in psychological counseling, presents unique challenges. Unlike tasks with objective correctness, success in therapeutic conversations depends on subjective factors like empathetic engagement, ethical adherence, and alignment with human preferences, for which strict "correctness" criteria are ill-defined. Existing result-oriented MCTS approaches can therefore produce misaligned responses. To address this, we introduce MCTSr-Zero, an MCTS framework designed for open-ended, human-centric dialogues. Its core innovation is "domain alignment", which shifts the MCTS search objective from predefined end-states towards conversational trajectories that conform to target domain principles (e.g., empathy in counseling). Furthermore, MCTSr-Zero incorporates "Regeneration" and "Meta-Prompt Adaptation" mechanisms to substantially broaden exploration by allowing the MCTS to consider fundamentally different initial dialogue strategies. We evaluate MCTSr-Zero in psychological counseling by generating multi-turn dialogue data, which is used to fine-tune an LLM, PsyLLM. We also introduce PsyEval, a benchmark for assessing multi-turn psychological counseling dialogues. Experiments demonstrate that PsyLLM achieves state-of-the-art performance on PsyEval and other relevant metrics, validating MCTSr-Zero's effectiveness in generating high-quality, principle-aligned conversational data for human-centric domains and addressing the LLM challenge of consistently adhering to complex psychological standards.
comment: 48 pages, 3 figures. Accepted in AAAI-2026 (Main Technical Track). For code and model, see this https://github.com/JianChengXingYun/Mctsr-Zero
♻ ☆ Leveraging LLM-based agents for social science research: insights from citation network simulations SC
The emergence of Large Language Models (LLMs) demonstrates their potential to encapsulate the logic and patterns inherent in human behavior simulation by leveraging extensive web data pre-training. However, the boundaries of LLM capabilities in social simulation remain unclear. To further explore the social attributes of LLMs, we introduce the CiteAgent framework, designed to generate citation networks based on human-behavior simulation with LLM-based agents. CiteAgent successfully captures predominant phenomena in real-world citation networks, including power-law distribution, citational distortion, and shrinking diameter. Building on this realistic simulation, we establish two LLM-based research paradigms in social science: LLM-SE (LLM-based Survey Experiment) and LLM-LE (LLM-based Laboratory Experiment). These paradigms facilitate rigorous analyses of citation network phenomena, allowing us to validate and challenge existing theories. Additionally, we extend the research scope of traditional science of science studies through idealized social experiments, with the simulation experiment results providing valuable insights for real-world academic environments. Our work demonstrates the potential of LLMs for advancing science of science research in social science.
comment: accepted by HSSCOMMS'25
♻ ☆ Fairness-Aware Graph Representation Learning with Limited Demographic Information
Ensuring fairness in Graph Neural Networks is fundamental to promoting trustworthy and socially responsible machine learning systems. In response, numerous fair graph learning methods have been proposed in recent years. However, most of them assume full access to demographic information, a requirement rarely met in practice due to privacy, legal, or regulatory restrictions. To this end, this paper introduces a novel fair graph learning framework that mitigates bias in graph learning under limited demographic information. Specifically, we propose a mechanism guided by partial demographic data to generate proxies for demographic information and design a strategy that enforces consistent node embeddings across demographic groups. In addition, we develop an adaptive confidence strategy that dynamically adjusts each node's contribution to fairness and utility based on prediction confidence. We further provide theoretical analysis demonstrating that our framework, FairGLite, achieves provable upper bounds on group fairness metrics, offering formal guarantees for bias mitigation. Through extensive experiments on multiple datasets and fair graph learning frameworks, we demonstrate the framework's effectiveness in both mitigating bias and maintaining model utility.
♻ ☆ A Framework for Developing University Policies on Generative AI Governance: A Cross-national Comparative Study
As generative AI (GAI) becomes more integrated into higher education, universities are actively exploring its governance and issuing guidelines to promote responsible use, reflecting varied stages of adoption and orientations. This study undertakes a comparative analysis of current GAI guidelines issued by leading universities in the United States, Japan, and China. Based on these findings, the study proposes a University Policy Development Framework for GAI (UPDF-GAI) to provide both theoretical insights and practical guidance for universities in developing and refining their GAI policies. This study adopts five domains from the extended Technology Acceptance Model. A qualitative content analysis of 124 policy documents from 110 universities was conducted, employing thematic coding to synthesize 20 key themes. These domains and themes form the foundation of the UPDF-GAI framework. The analysis reveals varying priorities and focus of GAI policy of universities in different countries. U.S. universities emphasize faculty autonomy, practical application, and policy adaptability, shaped by cutting-edge research and peer collaboration. Japanese universities take a government-regulated approach, prioritizing ethics and risk management, but provide limited support for AI implementation and flexibility. Chinese universities follow a centralized, government-led model, focusing on technology application over early policy development, while actively exploring GAI integration in education and research. The framework facilitates universities in formulating GAI policies by balancing its values and risks, providing multi-level support, proactively responding to societal impacts, and strengthening self-efficacy. In doing so, it enables the development of sustainable and context-sensitive policies that enhance digital competitiveness and advance preparedness for AI-driven education.
comment: Work in progress
♻ ☆ Artificial Intelligence Tools Expand Scientists' Impact but Contract Science's Focus
Development in Artificial Intelligence (AI) has accelerated scientific discovery. Alongside recent AI-oriented Nobel prizes, these trends establish the role of AI tools in science. This advancement raises questions about the potential influences of AI tools on scientists and science as a whole, and highlights a potential conflict between individual and collective benefits. To evaluate, we used a pretrained language model to identify AI-augmented research, with an F1-score of 0.875 in validation against expert-labeled data. Using a dataset of 41.3 million research papers across natural science and covering distinct eras of AI, here we show an accelerated adoption of AI tools among scientists and consistent professional advantages associated with AI usage, but a collective narrowing of scientific focus. Scientists who engage in AI-augmented research publish 3.02 times more papers, receive 4.84 times more citations, and become research project leaders 1.37 years earlier than those who do not. By contrast, AI adoption shrinks the collective volume of scientific topics studied by 4.63% and decreases scientist's engagement with one another by 22.00%. Thereby, AI adoption in science presents a seeming paradox -- an expansion of individual scientists' impact but a contraction in collective science's reach -- as AI-augmented work moves collectively toward areas richest in data. With reduced follow-on engagement, AI tools appear to automate established fields rather than explore new ones, highlighting a tension between personal advancement and collective scientific progress.
♻ ☆ Deploying Rapid Damage Assessments from sUAS Imagery for Disaster Response
This paper presents the first AI/ML system for automating building damage assessment in uncrewed aerial systems (sUAS) imagery to be deployed operationally during federally declared disasters (Hurricanes Debby and Helene). In response to major disasters, sUAS teams are dispatched to collect imagery of the affected areas to assess damage; however, at recent disasters, teams collectively delivered between 47GB and 369GB of imagery per day, representing more imagery than can reasonably be transmitted or interpreted by subject matter experts in the disaster scene, thus delaying response efforts. To alleviate this data avalanche encountered in practice, computer vision and machine learning techniques are necessary. While prior work has been deployed to automatically assess damage in satellite imagery, there is no current state of practice for sUAS-based damage assessment systems, as all known work has been confined to academic settings. This work establishes the state of practice via the development and deployment of models for building damage assessment with sUAS imagery. The model development involved training on the largest known dataset of post-disaster sUAS aerial imagery, containing 21,716 building damage labels, and the operational training of 91 disaster practitioners. The best performing model was deployed during the responses to Hurricanes Debby and Helene, where it assessed a combined 415 buildings in approximately 18 minutes. This work contributes documentation of the actual use of AI/ML for damage assessment during a disaster and lessons learned to the benefit of the AI/ML research and user communities.
comment: 6 pages, 4 figures, 1 table. Accepted - In Press, IAAI'26
Computers and Society
☆ Introducing AI to an Online Petition Platform Changed Outputs but not Outcomes
The rapid integration of AI writing tools into online platforms raises critical questions about their impact on content production and outcomes. We leverage a unique natural experiment on Change.org, a leading social advocacy platform, to causally investigate the effects of an in-platform ''write with AI'' tool. To understand the impact of the AI integration, we collected 1.5 million petitions and employed a difference-in-differences analysis. Our findings reveal that in-platform AI access significantly altered the lexical features of petitions and increased petition homogeneity, but did not improve petition outcomes. We confirmed the results in a separate analysis of repeat petition writers who wrote petitions before and after introduction of the AI tool. The results suggest that while AI writing tools can profoundly reshape online content, their practical utility for improving desired outcomes may be less beneficial than anticipated, and introduce unintended consequences like content homogenization.
☆ The Future of Food: How Artificial Intelligence is Transforming Food Manufacturing
Artificial intelligence is accelerating a new era of food innovation, connecting data from farm to consumer to improve formulation, processing, and health outcomes. Recent advances in deep learning, natural language processing, and multi-omics integration make it possible to understand and optimize food systems with unprecedented depth. However, AI adoption across the food sector remains uneven due to heterogeneous datasets, limited model and system interoperability, and a persistent skills gap between data scientists and food domain experts. To address these challenges and advance responsible innovation, the AI Institute for Next Generation Food Systems (AIFS) convened the inaugural AI for Food Product Development Symposium at University of California, Davis, in October 2025. This white paper synthesizes insights from the symposium, organized around five domains where AI can have the greatest near-term impact: supply chain; formulation and processing; consumer insights and sensory prediction; nutrition and health; and education and workforce development. Across the areas, participants emphasized the importance of interoperable data standards, transparent and interpretable models, and cross-sector collaboration to accelerate the translation of AI research into practice. The discussions further highlighted the need for robust digital infrastructure, privacy-preserving data-sharing mechanisms, and interdisciplinary training pathways that integrate AI literacy with domain expertise. Collectively, the priorities outline a roadmap for integrating AI into food manufacturing in ways that enhance innovation, sustainability, and human well-being while ensuring that technological progress remains grounded in ethics, scientific rigor, and societal benefit.
☆ GAEA: Experiences and Lessons Learned from a Country-Scale Environmental Digital Twin
This paper describes the experiences and lessons learned after the deployment of a country-scale environmental digital twin on the island of Cyprus for three years. This digital twin, called GAEA, contains 27 environmental geospatial services and is suitable for urban planners, policymakers, farmers, property owners, real-estate and forestry professionals, as well as insurance companies and banks that have properties in their portfolio. This paper demonstrates the power, potential, current and future challenges of geospatial analytics and environmental digital twins on a large scale.
☆ Freedom of expression and 'right to be forgotten' cases in the Netherlands after Google Spain
Since the Google Spain judgment of the Court of Justice of the European Union, Europeans have, under certain conditions, the right to have search results for their name delisted. This paper examines how the Google Spain judgment has been applied in the Netherlands. Since the Google Spain judgment, Dutch courts have decided on two cases regarding delisting requests. In both cases, the Dutch courts considered freedom of expression aspects of delisting more thoroughly than the Court of Justice. However, the effect of the Google Spain judgment on freedom of expression is difficult to assess, as search engine operators decide about most delisting requests without disclosing much about their decisions.
☆ Access to Personal Data and the Right to Good Governance during Asylum Procedures after the CJEU's YS. and M. and S. judgment
In the YS. and M. and S. judgment, the Court of Justice of the European Union ruled on three procedures in which Dutch judges asked for clarification on the right of asylum seekers to have access to the documents regarding the decision on asylum applications. The judgment is relevant for interpreting the concept of personal data and the scope of the right of access under the Data Protection Directive, and the right to good administration in the EU Charter of Fundamental Rights. At first glance, the judgment seems disappointing from the viewpoint of individual rights. Nevertheless, in our view the judgment provides sufficient grounds for effective access rights to the minutes in future asylum cases.
☆ New Data Security Requirements and the Proceduralization of Mass Surveillance Law after the European Data Retention Case
This paper discusses the regulation of mass metadata surveillance in Europe through the lens of the landmark judgment in which the Court of Justice of the European Union struck down the Data Retention Directive. The controversial directive obliged telecom and Internet access providers in Europe to retain metadata of all their customers for intelligence and law enforcement purposes, for a period of up to two years. In the ruling, the Court declared the directive in violation of the human rights to privacy and data protection. The Court also confirmed that the mere collection of metadata interferes with the human right to privacy. In addition, the Court developed three new criteria for assessing the level of data security required from a human rights perspective: security measures should take into account the risk of unlawful access to data, and the data's quantity and sensitivity. While organizations that campaigned against the directive have welcomed the ruling, we warn for the risk of proceduralization of mass surveillance law. The Court did not fully condemn mass surveillance that relies on metadata, but left open the possibility of mass surveillance if policymakers lay down sufficient procedural safeguards. Such proceduralization brings systematic risks for human rights. Government agencies, with ample resources, can design complicated systems of procedural oversight for mass surveillance - and claim that mass surveillance is lawful, even if it affects millions of innocent people.
☆ AI Fairness Beyond Complete Demographics: Current Achievements and Future Directions ECAI 2025
Fairness in artificial intelligence (AI) has become a growing concern due to discriminatory outcomes in AI-based decision-making systems. While various methods have been proposed to mitigate bias, most rely on complete demographic information, an assumption often impractical due to legal constraints and the risk of reinforcing discrimination. This survey examines fairness in AI when demographics are incomplete, addressing the gap between traditional approaches and real-world challenges. We introduce a novel taxonomy of fairness notions in this setting, clarifying their relationships and distinctions. Additionally, we summarize existing techniques that promote fairness beyond complete demographics and highlight open research questions to encourage further progress in the field.
comment: ECAI 2025
☆ The Last Vote: A Multi-Stakeholder Framework for Language Model Governance NeurIPS 2025
As artificial intelligence systems become increasingly powerful and pervasive, democratic societies face unprecedented challenges in governing these technologies while preserving core democratic values and institutions. This paper presents a comprehensive framework to address the full spectrum of risks that AI poses to democratic societies. Our approach integrates multi-stakeholder participation, civil society engagement, and existing international governance frameworks while introducing novel mechanisms for risk assessment and institutional adaptation. We propose: (1) a seven-category democratic risk taxonomy extending beyond individual-level harms to capture systemic threats, (2) a stakeholder-adaptive Incident Severity Score (ISS) that incorporates diverse perspectives and context-dependent risk factors, and (3) a phased implementation strategy that acknowledges the complex institutional changes required for effective AI governance.
comment: This paper has been accepted to the NeurIPS 2025 Workshop on Algorithmic Collective Action (ACA@NeurIPS 2025). The submission is 26 pages including the appendix and includes the NeurIPS checklist. A big thanks to Avijit Ghosh
☆ Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment AAAI 2026
Humans display significant uncertainty when confronted with moral dilemmas, yet the extent of such uncertainty in machines and AI agents remains underexplored. Recent studies have confirmed the overly confident tendencies of machine-generated responses, particularly in large language models (LLMs). As these systems are increasingly embedded in ethical decision-making scenarios, it is important to understand their moral reasoning and the inherent uncertainties in building reliable AI systems. This work examines how uncertainty influences moral decisions in the classical trolley problem, analyzing responses from 32 open-source models and 9 distinct moral dimensions. We first find that variance in model confidence is greater across models than within moral dimensions, suggesting that moral uncertainty is predominantly shaped by model architecture and training method. To quantify uncertainty, we measure binary entropy as a linear combination of total entropy, conditional entropy, and mutual information. To examine its effects, we introduce stochasticity into models via "dropout" at inference time. Our findings show that our mechanism increases total entropy, mainly through a rise in mutual information, while conditional entropy remains largely unchanged. Moreover, this mechanism significantly improves human-LLM moral alignment, with correlations in mutual information and alignment score shifts. Our results highlight the potential to better align model-generated decisions and human preferences by deliberately modulating uncertainty and reducing LLMs' confidence in morally complex scenarios.
comment: Accepted to AAAI 2026
☆ Computational Measurement of Political Positions: A Review of Text-Based Ideal Point Estimation Algorithms
This article presents the first systematic review of unsupervised and semi-supervised computational text-based ideal point estimation (CT-IPE) algorithms, methods designed to infer latent political positions from textual data. These algorithms are widely used in political science, communication, computational social science, and computer science to estimate ideological preferences from parliamentary speeches, party manifestos, and social media. Over the past two decades, their development has closely followed broader NLP trends -- beginning with word-frequency models and most recently turning to large language models (LLMs). While this trajectory has greatly expanded the methodological toolkit, it has also produced a fragmented field that lacks systematic comparison and clear guidance for applied use. To address this gap, we identified 25 CT-IPE algorithms through a systematic literature review and conducted a manual content analysis of their modeling assumptions and development contexts. To compare them meaningfully, we introduce a conceptual framework that distinguishes how algorithms generate, capture, and aggregate textual variance. On this basis, we identify four methodological families -- word-frequency, topic modeling, word embedding, and LLM-based approaches -- and critically assess their assumptions, interpretability, scalability, and limitations. Our review offers three contributions. First, it provides a structured synthesis of two decades of algorithm development, clarifying how diverse methods relate to one another. Second, it translates these insights into practical guidance for applied researchers, highlighting trade-offs in transparency, technical requirements, and validation strategies that shape algorithm choice. Third, it emphasizes that differences in estimation outcomes across algorithms are themselves informative, underscoring the need for systematic benchmarking.
comment: 46 pages, 8 figures, 2 tables, accepted for publication in Quality & Quantity
♻ ☆ Theories of "Sexuality" in Natural Language Processing Bias Research
In recent years, significant advancements in the field of Natural Language Processing (NLP) have positioned commercialized language models as wide-reaching, highly useful tools. In tandem, there has been an explosion of multidisciplinary research examining how NLP tasks reflect, perpetuate, and amplify social biases such as gender and racial bias. A significant gap in this scholarship is a detailed analysis of how queer sexualities are encoded and (mis)represented by both NLP systems and practitioners. Following previous work in the field of AI fairness, we document how sexuality is defined and operationalized via a survey and analysis of 55 articles that quantify sexuality-based NLP bias. We find that sexuality is not clearly defined in a majority of the literature surveyed, indicating a reliance on assumed or normative conceptions of sexual/romantic practices and identities. Further, we find that methods for extracting biased outputs from NLP technologies often conflate gender and sexual identities, leading to monolithic conceptions of queerness and thus improper quantifications of bias. With the goal of improving sexuality-based NLP bias analyses, we conclude with recommendations that encourage more thorough engagement with both queer communities and interdisciplinary literature.
comment: 17 pages, 6 tables, 1 figure, undergraduate senior thesis, submitted to The Spectra: The Virginia Engineering and Science Research Journal
♻ ☆ Generative AI for Multiple Choice STEM Assessments
Artificial intelligence (AI) technology enables a range of enhancements in computer-aided instruction, from accelerating the creation of teaching materials to customizing learning paths based on learner outcomes. However, ensuring the mathematical accuracy and semantic integrity of generative AI output remains a significant challenge, particularly in Science, Technology, Engineering and Mathematics (STEM) disciplines. In this study, we explore the use of generative AI in which "hallucinations", typically viewed as undesirable inaccuracies, can instead serve a pedagogical purpose. Specifically, we investigate the generation of plausible but incorrect alternatives for multiple choice assessments, where credible distractors are essential for effective assessment design. We describe the Moebius platform for online instruction, with particular focus on its architecture for handling mathematical elements through specialized semantic packages that support dynamic, parameterized STEM content. We examine methods for crafting prompts that interact effectively with these mathematical semantics to guide the AI in generating high-quality multiple choice distractors. Finally, we demonstrate how this approach reduces the time and effort associated with creating robust teaching materials while maintaining academic rigor and assessment validity.
♻ ☆ Can Machines Think Like Humans? A Behavioral Evaluation of LLM Agents in Dictator Games
As Large Language Model (LLM)-based agents increasingly engage with human society, how well do we understand their prosocial behaviors? We (1) investigate how LLM agents' prosocial behaviors can be induced by different personas and benchmarked against human behaviors; and (2) introduce a social science approach to evaluate LLM agents' decision-making. We explored how different personas and experimental framings affect these AI agents' altruistic behavior in dictator games and compared their behaviors within the same LLM family, across various families, and with human behaviors. The findings reveal that merely assigning a human-like identity to LLMs does not produce human-like behaviors. These findings suggest that LLM agents' reasoning does not consistently exhibit textual markers of human decision-making in dictator games and that their alignment with human behavior varies substantially across model architectures and prompt formulations; even worse, such dependence does not follow a clear pattern. As society increasingly integrates machine intelligence, "Prosocial AI" emerges as a promising and urgent research direction in philanthropic studies.
♻ ☆ LocalBench: Benchmarking LLMs on County-Level Local Knowledge and Reasoning
Large language models (LLMs) have been widely evaluated on macro-scale geographic tasks, such as global factual recall, event summarization, and regional reasoning. Yet, their ability to handle hyper-local knowledge remains poorly understood. This gap is increasingly consequential as real-world applications, from civic platforms to community journalism, demand AI systems that can reason about neighborhood-specific dynamics, cultural narratives, and local governance. Existing benchmarks fall short in capturing this complexity, often relying on coarse-grained data or isolated references. We present LocalBench, the first benchmark designed to systematically evaluate LLMs on county-level local knowledge across the United States. Grounded in the Localness Conceptual Framework, LocalBench includes 14,782 validated question-answer pairs across 526 U.S. counties in 49 states, integrating diverse sources such as Census statistics, local subreddit discourse, and regional news. It spans physical, cognitive, and relational dimensions of locality. Using LocalBench, we evaluate 13 state-of-the-art LLMs under both closed-book and web-augmented settings. Our findings reveal critical limitations: even the best-performing models reach only 56.8% accuracy on narrative-style questions and perform below 15.5% on numerical reasoning. Moreover, larger model size and web augmentation do not guarantee better performance, for example, search improves Gemini's accuracy by +13.6%, but reduces GPT-series performance by -11.4%. These results underscore the urgent need for language models that can support equitable, place-aware AI systems: capable of engaging with the diverse, fine-grained realities of local communities across geographic and cultural contexts.
♻ ☆ Optimizing Urban Service Allocation with Time-Constrained Restless Bandits
Municipal inspections are an important part of maintaining the quality of goods and services. In this paper, we approach the problem of intelligently scheduling service inspections to maximize their impact, using the case of food establishment inspections in Chicago as a case study. The Chicago Department of Public Health (CDPH) inspects thousands of establishments each year, with a substantial fail rate (over 3,000 failed inspection reports in 2023). To balance the objectives of ensuring adherence to guidelines, minimizing disruption to establishments, and minimizing inspection costs, CDPH assigns each establishment an inspection window every year and guarantees that they will be inspected exactly once during that window. Meanwhile, CDPH also promises surprise public health inspections for unexpected food safety emergencies or complaints. These constraints create a challenge for a restless multi-armed bandit (RMAB) approach, for which there are no existing methods. We develop an extension to Whittle index-based systems for RMABs that can guarantee action window constraints and frequencies, and furthermore can be leveraged to optimize action window assignments themselves. Briefly, we combine MDP reformulation and integer programming-based lookahead to maximize the impact of inspections subject to constraints. A neural network-based supervised learning model is developed to model state transitions of real Chicago establishments using public CDPH inspection records, which demonstrates 10% AUC improvements compared with directly predicting establishments' failures. Our experiments not only show up to 24% (in simulation) or 33% (on real data) objective improvements resulting from our approach and robustness to surprise inspections, but also give insight into the impact of scheduling constraints.
♻ ☆ Ken Utilization Layer: Hebbian Replay Within a Student's Ken for Adaptive Exercise Recommendation
Adaptive exercise recommendation (ER) aims to choose the next activity that matches a learner's evolving Zone of Proximal Development (ZPD). We present KUL-Rec, a biologically inspired ER system that couples a fast Hebbian memory with slow replay-based consolidation to enable continual, few-shot personalization from sparse interactions. The model operates in an embedding space, allowing a single architecture to handle both tabular knowledge-tracing logs and open-ended short-answer text. We align evaluation with tutoring needs using bidirectional ranking and rank-sensitive metrics (nDCG, Recall@K). Across ten public datasets, KUL-Rec improves macro nDCG (0.316 vs. 0.265 for the strongest baseline) and Recall@10 (0.305 vs. 0.211), while achieving low inference latency and an $\approx99$\% reduction in peak GPU memory relative to a competitive graph-based model. In a 13-week graduate course, KUL-Rec personalized weekly short-answer quizzes generated by a retrieval-augmented pipeline and the personalized quizzes were associated with lower perceived difficulty and higher helpfulness (p < .05). An embedding robustness audit highlights that encoder choice affects semantic alignment, motivating routine audits when deploying open-response assessment. Together, these results indicate that Hebbian replay with bounded consolidation offers a practical path to real-time, interpretable ER that scales across data modalities and classroom settings.
♻ ☆ A GPU-Accelerated RAG-Based Telegram Assistant for Supporting Parallel Processing Students
This project addresses a critical pedagogical need: offering students continuous, on-demand academic assistance beyond conventional reception hours. I present a domain-specific Retrieval-Augmented Generation (RAG) system powered by a quantized Mistral-7B Instruct model and deployed as a Telegram bot. The assistant enhances learning by delivering real-time, personalized responses aligned with the "Introduction to Parallel Processing" course materials. GPU acceleration significantly improves inference latency, enabling practical deployment on consumer hardware. This approach demonstrates how consumer GPUs can enable affordable, private, and effective AI tutoring for HPC education.
comment: 9 pages
♻ ☆ Reinforcing Trustworthiness in Multimodal Emotional Support Systems
In today's world, emotional support is increasingly essential, yet it remains challenging for both those seeking help and those offering it. Multimodal approaches to emotional support show great promise by integrating diverse data sources to provide empathetic, contextually relevant responses, fostering more effective interactions. However, current methods have notable limitations, often relying solely on text or converting other data types into text, or providing emotion recognition only, thus overlooking the full potential of multimodal inputs. Moreover, many studies prioritize response generation without accurately identifying critical emotional support elements or ensuring the reliability of outputs. To overcome these issues, we introduce \textsc{ MultiMood}, a new framework that (i) leverages multimodal embeddings from video, audio, and text to predict emotional components and to produce responses responses aligned with professional therapeutic standards. To improve trustworthiness, we (ii) incorporate novel psychological criteria and apply Reinforcement Learning (RL) to optimize large language models (LLMs) for consistent adherence to these standards. We also (iii) analyze several advanced LLMs to assess their multimodal emotional support capabilities. Experimental results show that MultiMood achieves state-of-the-art on MESC and DFEW datasets while RL-driven trustworthiness improvements are validated through human and LLM evaluations, demonstrating its superior capability in applying a multimodal framework in this domain.
♻ ☆ EXAGREE: Mitigating Explanation Disagreement with Stakeholder-Aligned Models
Conflicting explanations, arising from different attribution methods or model internals, limit the adoption of machine learning models in safety-critical domains. We turn this disagreement into an advantage and introduce EXplanation AGREEment (EXAGREE), a two-stage framework that selects a Stakeholder-Aligned Explanation Model (SAEM) from a set of similar-performing models. The selection maximizes Stakeholder-Machine Agreement (SMA), a single metric that unifies faithfulness and plausibility. EXAGREE couples a differentiable mask-based attribution network (DMAN) with monotone differentiable sorting, enabling gradient-based search inside the constrained model space. Experiments on six real-world datasets demonstrate simultaneous gains of faithfulness, plausibility, and fairness over baselines, while preserving task accuracy. Extensive ablation studies, significance tests, and case studies confirm the robustness and feasibility of the method in practice.
♻ ☆ Human-Centered Development of Indicators for Self-Service Learning Analytics: A Transparency through Exploration Approach
The aim of learning analytics is to turn educational data into insights, decisions, and actions to improve learning and teaching. The reasoning of the provided insights, decisions, and actions is often not transparent to the end-user, and this can lead to trust and acceptance issues when interventions, feedback, and recommendations fail. In this paper, we shed light on achieving transparent learning analytics by following a transparency through exploration approach. To this end, we present the design, implementation, and evaluation details of the Indicator Editor, which aims to support self-service learning analytics (SSLA) by empowering end-users to take control of the indicator implementation process. We systematically designed and implemented the Indicator Editor through an iterative human-centered design (HCD) approach. Further, we conducted a qualitative user study (n=15) to investigate the impact of following an SSLA approach on the users' perception of and interaction with the Indicator Editor. Our study showed qualitative evidence that supporting user interaction and providing user control in the indicator implementation process can have positive effects on different crucial aspects of learning analytics, namely transparency, trust, satisfaction, and acceptance.
comment: Submitted to JLA - revised version
♻ ☆ From Model Training to Model Raising
Current AI training methods align models with human values only after their core capabilities have been established, resulting in models that are easily misaligned and lack deep-rooted value systems. We propose a paradigm shift from "model training" to "model raising", in which alignment is woven into a model's development from the start. We identify several key components for this paradigm, all centered around redesigning the training corpus: reframing training data from a first-person perspective, recontextualizing information as lived experience, simulating social interactions, and scaffolding the ordering of training data. We expect that this redesign of the training corpus will lead to an early commitment to values from the first training token onward, such that knowledge, skills, and values are intrinsically much harder to separate. In an ecosystem in which large language model capabilities start overtaking human capabilities in many tasks, this seems to us like a critical need.
comment: Accepted for publication in Communications of the ACM (CACM), Opinion section
♻ ☆ The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1
The rapid development of large reasoning models (LRMs), such as OpenAI-o3 and DeepSeek-R1, has led to significant improvements in complex reasoning over non-reasoning large language models~(LLMs). However, their enhanced capabilities, combined with the open-source access of models like DeepSeek-R1, raise serious safety concerns, particularly regarding their potential for misuse. In this work, we present a comprehensive safety assessment of these reasoning models, leveraging established safety benchmarks to evaluate their compliance with safety regulations. Furthermore, we investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications. Through our multi-faceted analysis, we uncover four key findings: (1) There is a significant safety gap between the open-source reasoning models and the o3-mini model, on both safety benchmark and attack, suggesting more safety effort on open LRMs is needed. (2) The stronger the model's reasoning ability, the greater the potential harm it may cause when answering unsafe questions. (3) Safety thinking emerges in the reasoning process of LRMs, but fails frequently against adversarial attacks. (4) The thinking process in R1 models poses greater safety concerns than their final answers. Our study provides insights into the security implications of reasoning models and highlights the need for further advancements in R1 models' safety to close the gap.