Computation and Language
☆ Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation
Large Vision-Language Models (VLMs) have achieved remarkable progress in
multimodal understanding, yet they struggle when reasoning over
information-intensive images that densely interleave textual annotations with
fine-grained graphical elements. The main challenges lie in precisely
localizing critical cues in dense layouts and multi-hop reasoning to integrate
dispersed evidence. We propose Speculative Verdict (SV), a training-free
framework inspired by speculative decoding that combines multiple lightweight
draft experts with a large verdict model. In the draft stage, small VLMs act as
draft experts to generate reasoning paths that provide diverse localization
candidates; in the verdict stage, a strong VLM synthesizes these paths to
produce the final answer, minimizing computational cost while recovering
correct answers. To further improve efficiency and accuracy, SV introduces a
consensus expert selection mechanism that forwards only high-agreement
reasoning paths to the verdict. Empirically, SV achieves consistent gains on
challenging information-intensive and high-resolution visual question answering
benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K.
By synthesizing correct insights from multiple partially accurate reasoning
paths, SV achieves both error correction and cost-efficiency compared to large
proprietary models or training pipelines. Code is available at
https://github.com/Tinaliu0123/speculative-verdict
☆ On the Detectability of LLM-Generated Text: What Exactly Is LLM-Generated Text?
With the widespread use of large language models (LLMs), many researchers
have turned their attention to detecting text generated by them. However, there
is no consistent or precise definition of their target, namely "LLM-generated
text". Differences in usage scenarios and the diversity of LLMs further
increase the difficulty of detection. What is commonly regarded as the
detecting target usually represents only a subset of the text that LLMs can
potentially produce. Human edits to LLM outputs, together with the subtle
influences that LLMs exert on their users, are blurring the line between
LLM-generated and human-written text. Existing benchmarks and evaluation
approaches do not adequately address the various conditions in real-world
detector applications. Hence, the numerical results of detectors are often
misunderstood, and their significance is diminishing. Therefore, detectors
remain useful under specific conditions, but their results should be
interpreted only as references rather than decisive indicators.
☆ Real Deep Research for AI, Robotics and Beyond
Xueyan Zou, Jianglong Ye, Hao Zhang, Xiaoyu Xiang, Mingyu Ding, Zhaojing Yang, Yong Jae Lee, Zhuowen Tu, Sifei Liu, Xiaolong Wang
With the rapid growth of research in AI and robotics now producing over
10,000 papers annually it has become increasingly difficult for researchers to
stay up to date. Fast evolving trends, the rise of interdisciplinary work, and
the need to explore domains beyond one's expertise all contribute to this
challenge. To address these issues, we propose a generalizable pipeline capable
of systematically analyzing any research area: identifying emerging trends,
uncovering cross domain opportunities, and offering concrete starting points
for new inquiry. In this work, we present Real Deep Research (RDR) a
comprehensive framework applied to the domains of AI and robotics, with a
particular focus on foundation models and robotics advancements. We also
briefly extend our analysis to other areas of science. The main paper details
the construction of the RDR pipeline, while the appendix provides extensive
results across each analyzed topic. We hope this work sheds light for
researchers working in the field of AI and beyond.
comment: website: https://realdeepresearch.github.io
☆ Compress to Impress: Efficient LLM Adaptation Using a Single Gradient Step on 100 Samples
Recently, Sharma et al. suggested a method called Layer-SElective-Rank
reduction (LASER) which demonstrated that pruning high-order components of
carefully chosen LLM's weight matrices can boost downstream accuracy -- without
any gradient-based fine-tuning. Yet LASER's exhaustive, per-matrix search (each
requiring full-dataset forward passes) makes it impractical for rapid
deployment. We demonstrate that this overhead can be removed and find that: (i)
Only a small, carefully chosen subset of matrices needs to be inspected --
eliminating the layer-by-layer sweep, (ii) The gradient of each matrix's
singular values pinpoints which matrices merit reduction, (iii) Increasing the
factorization search space by allowing matrices rows to cluster around multiple
subspaces and then decomposing each cluster separately further reduces
overfitting on the original training data and further lifts accuracy by up to
24.6 percentage points, and finally, (iv) we discover that evaluating on just
100 samples rather than the full training data -- both for computing the
indicative gradients and for measuring the final accuracy -- suffices to
further reduce the search time; we explain that as adaptation to downstream
tasks is dominated by prompting style, not dataset size. As a result, we show
that combining these findings yields a fast and robust adaptation algorithm for
downstream tasks. Overall, with a single gradient step on 100 examples and a
quick scan of the top candidate layers and factorization techniques, we can
adapt LLMs to new datasets -- entirely without fine-tuning.
☆ Simple Context Compression: Mean-Pooling and Multi-Ratio Training
A common strategy to reduce the computational costs of using long contexts in
retrieval-augmented generation (RAG) with large language models (LLMs) is soft
context compression, where the input sequence is transformed into a shorter
continuous representation. We develop a lightweight and simple mean-pooling
approach that consistently outperforms the widely used compression-tokens
architecture, and study training the same compressor to output multiple
compression ratios. We conduct extensive experiments across in-domain and
out-of-domain QA datasets, as well as across model families, scales, and
compression ratios. Overall, our simple mean-pooling approach achieves the
strongest performance, with a relatively small drop when training for multiple
compression ratios. More broadly though, across architectures and training
regimes the trade-offs are more nuanced, illustrating the complex landscape of
compression methods.
comment: Code available at
https://github.com/lil-lab/simple-context-compression
☆ BadGraph: A Backdoor Attack Against Latent Diffusion Model for Text-Guided Graph Generation
The rapid progress of graph generation has raised new security concerns,
particularly regarding backdoor vulnerabilities. While prior work has explored
backdoor attacks in image diffusion and unconditional graph generation,
conditional, especially text-guided graph generation remains largely
unexamined. This paper proposes BadGraph, a backdoor attack method targeting
latent diffusion models for text-guided graph generation. BadGraph leverages
textual triggers to poison training data, covertly implanting backdoors that
induce attacker-specified subgraphs during inference when triggers appear,
while preserving normal performance on clean inputs. Extensive experiments on
four benchmark datasets (PubChem, ChEBI-20, PCDes, MoMu) demonstrate the
effectiveness and stealth of the attack: less than 10% poisoning rate can
achieves 50% attack success rate, while 24% suffices for over 80% success rate,
with negligible performance degradation on benign samples. Ablation studies
further reveal that the backdoor is implanted during VAE and diffusion training
rather than pretraining. These findings reveal the security vulnerabilities in
latent diffusion models of text-guided graph generation, highlight the serious
risks in models' applications such as drug discovery and underscore the need
for robust defenses against the backdoor attack in such diffusion models.
☆ Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction
Linear-attention models that compress the entire input sequence into a
fixed-size recurrent state offer an efficient alternative to Transformers, but
their finite memory induces forgetfulness that harms retrieval-intensive tasks.
To mitigate the issue, we explore a series of hybrid models that restore direct
access to past tokens. We interleave token mixers with intermediate time and
space complexity between linear and full attention, including sparse attention
with token eviction, and the query-aware native sparse attention. Particularly,
we propose a novel learnable token eviction approach. Combined with
sliding-window attention, an end-to-end trainable lightweight CNN aggregates
information from both past and future adjacent tokens to adaptively retain a
limited set of critical KV-pairs per head, maintaining linear attention's
constant time and space complexity. Efficient Triton kernels for the sparse
attention mechanisms are provided. Empirical evaluations on retrieval-intensive
benchmarks support the effectiveness of our approaches.
comment: 19 pages, 5 figures
☆ A Use-Case Specific Dataset for Measuring Dimensions of Responsible Performance in LLM-generated Text CIKM '25
Current methods for evaluating large language models (LLMs) typically focus
on high-level tasks such as text generation, without targeting a particular AI
application. This approach is not sufficient for evaluating LLMs for
Responsible AI dimensions like fairness, since protected attributes that are
highly relevant in one application may be less relevant in another. In this
work, we construct a dataset that is driven by a real-world application
(generate a plain-text product description, given a list of product features),
parameterized by fairness attributes intersected with gendered adjectives and
product categories, yielding a rich set of labeled prompts. We show how to use
the data to identify quality, veracity, safety, and fairness gaps in LLMs,
contributing a proposal for LLM evaluation paired with a concrete resource for
the research community.
comment: 24 pages with 3 figures, to appear in Proceedings of the 34th ACM
International Conference on Information and Knowledge Management (CIKM '25)
☆ Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost NeurIPS 2025
Recent advancements in large reasoning models (LRMs) have introduced an
intermediate "thinking" process prior to generating final answers, improving
their reasoning capabilities on complex downstream tasks. However, the
potential of LRMs as evaluators for machine translation (MT) quality remains
underexplored. We provides the first systematic analysis of LRM-as-a-judge in
MT evaluation. We identify key challenges, revealing LRMs require tailored
evaluation materials, tend to "overthink" simpler instances and have issues
with scoring mechanisms leading to overestimation. To address these, we propose
to calibrate LRM thinking by training them on synthetic, human-like thinking
trajectories. Our experiments on WMT24 Metrics benchmarks demonstrate that this
approach largely reduces thinking budgets by ~35x while concurrently improving
evaluation performance across different LRM scales from 7B to 32B (e.g.,
R1-Distill-Qwen-7B achieves a +8.7 correlation point improvement). These
findings highlight the potential of efficiently calibrated LRMs to advance
fine-grained automatic MT evaluation.
comment: NeurIPS 2025
☆ Empathic Prompting: Non-Verbal Context Integration for Multimodal LLM Conversations
Lorenzo Stacchio, Andrea Ubaldi, Alessandro Galdelli, Maurizio Mauri, Emanuele Frontoni, Andrea Gaggioli
We present Empathic Prompting, a novel framework for multimodal human-AI
interaction that enriches Large Language Model (LLM) conversations with
implicit non-verbal context. The system integrates a commercial facial
expression recognition service to capture users' emotional cues and embeds them
as contextual signals during prompting. Unlike traditional multimodal
interfaces, empathic prompting requires no explicit user control; instead, it
unobtrusively augments textual input with affective information for
conversational and smoothness alignment. The architecture is modular and
scalable, allowing integration of additional non-verbal modules. We describe
the system design, implemented through a locally deployed DeepSeek instance,
and report a preliminary service and usability evaluation (N=5). Results show
consistent integration of non-verbal input into coherent LLM outputs, with
participants highlighting conversational fluidity. Beyond this proof of
concept, empathic prompting points to applications in chatbot-mediated
communication, particularly in domains like healthcare or education, where
users' emotional signals are critical yet often opaque in verbal exchanges.
☆ Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems
We present a multi-agent, human-in-the-loop workflow that co-designs quantum
codes with prescribed transversal diagonal gates. It builds on the Subset-Sum
Linear Programming (SSLP) framework (arXiv:2504.20847), which partitions basis
strings by modular residues and enforces $Z$-marginal Knill-Laflamme (KL)
equalities via small LPs. The workflow is powered by GPT-5 and implemented
within TeXRA (https://texra.ai)-a multi-agent research assistant platform that
supports an iterative tool-use loop agent and a derivation-then-edit workflow
reasoning agent. We work in a LaTeX-Python environment where agents reason,
edit documents, execute code, and synchronize their work to Git/Overleaf.
Within this workspace, three roles collaborate: a Synthesis Agent formulates
the problem; a Search Agent sweeps/screens candidates and exactifies numerics
into rationals; and an Audit Agent independently checks all KL equalities and
the induced logical action. As a first step we focus on distance $d=2$ with
nondegenerate residues. For code dimension $K\in\{2,3,4\}$ and $n\le6$ qubits,
systematic sweeps yield certificate-backed tables cataloging attainable cyclic
logical groups-all realized by new codes-e.g., for $K=3$ we obtain order $16$
at $n=6$. From verified instances, Synthesis Agent abstracts recurring
structures into closed-form families and proves they satisfy the KL equalities
for all parameters. It further demonstrates that SSLP accommodates residue
degeneracy by exhibiting a new $((6,4,2))$ code implementing the transversal
controlled-phase $diag(1,1,1,i)$. Overall, the workflow recasts
diagonal-transversal feasibility as an analytical pipeline executed at scale,
combining systematic enumeration with exact analytical reconstruction. It
yields reproducible code constructions, supports targeted extensions to larger
$K$ and higher distances, and leads toward data-driven classification.
comment: 29 pages, 2 figures
☆ Automated Extraction of Fluoropyrimidine Treatment and Treatment-Related Toxicities from Clinical Notes Using Natural Language Processing
Objective: Fluoropyrimidines are widely prescribed for colorectal and breast
cancers, but are associated with toxicities such as hand-foot syndrome and
cardiotoxicity. Since toxicity documentation is often embedded in clinical
notes, we aimed to develop and evaluate natural language processing (NLP)
methods to extract treatment and toxicity information.
Materials and Methods: We constructed a gold-standard dataset of 236 clinical
notes from 204,165 adult oncology patients. Domain experts annotated categories
related to treatment regimens and toxicities. We developed rule-based, machine
learning-based (Random Forest, Support Vector Machine [SVM], Logistic
Regression [LR]), deep learning-based (BERT, ClinicalBERT), and large language
models (LLM)-based NLP approaches (zero-shot and error-analysis prompting).
Models used an 80:20 train-test split.
Results: Sufficient data existed to train and evaluate 5 annotated
categories. Error-analysis prompting achieved optimal precision, recall, and F1
scores (F1=1.000) for treatment and toxicities extraction, whereas zero-shot
prompting reached F1=1.000 for treatment and F1=0.876 for toxicities
extraction.LR and SVM ranked second for toxicities (F1=0.937). Deep learning
underperformed, with BERT (F1=0.873 treatment; F1= 0.839 toxicities) and
ClinicalBERT (F1=0.873 treatment; F1 = 0.886 toxicities). Rule-based methods
served as our baseline with F1 scores of 0.857 in treatment and 0.858 in
toxicities.
Discussion: LMM-based approaches outperformed all others, followed by machine
learning methods. Machine and deep learning approaches were limited by small
training data and showed limited generalizability, particularly for rare
categories.
Conclusion: LLM-based NLP most effectively extracted fluoropyrimidine
treatment and toxicity information from clinical notes, and has strong
potential to support oncology research and pharmacovigilance.
☆ User Perceptions of Privacy and Helpfulness in LLM Responses to Privacy-Sensitive Scenarios
Large language models (LLMs) have seen rapid adoption for tasks such as
drafting emails, summarizing meetings, and answering health questions. In such
uses, users may need to share private information (e.g., health records,
contact details). To evaluate LLMs' ability to identify and redact such private
information, prior work developed benchmarks (e.g., ConfAIde, PrivacyLens) with
real-life scenarios. Using these benchmarks, researchers have found that LLMs
sometimes fail to keep secrets private when responding to complex tasks (e.g.,
leaking employee salaries in meeting summaries). However, these evaluations
rely on LLMs (proxy LLMs) to gauge compliance with privacy norms, overlooking
real users' perceptions. Moreover, prior work primarily focused on the
privacy-preservation quality of responses, without investigating nuanced
differences in helpfulness. To understand how users perceive the
privacy-preservation quality and helpfulness of LLM responses to
privacy-sensitive scenarios, we conducted a user study with 94 participants
using 90 scenarios from PrivacyLens. We found that, when evaluating identical
responses to the same scenario, users showed low agreement with each other on
the privacy-preservation quality and helpfulness of the LLM response. Further,
we found high agreement among five proxy LLMs, while each individual LLM had
low correlation with users' evaluations. These results indicate that the
privacy and helpfulness of LLM responses are often specific to individuals, and
proxy LLMs are poor estimates of how real users would perceive these responses
in privacy-sensitive scenarios. Our results suggest the need to conduct
user-centered studies on measuring LLMs' ability to help users while preserving
privacy. Additionally, future research could investigate ways to improve the
alignment between proxy LLMs and users for better estimation of users'
perceived privacy and utility.
☆ Structure-Conditional Minimum Bayes Risk Decoding EMNLP 2025
Minimum Bayes Risk (MBR) decoding has seen renewed interest as an alternative
to traditional generation strategies. While MBR has proven effective in machine
translation, where the variability of a language model's outcome space is
naturally constrained, it may face challenges in more open-ended tasks such as
dialogue or instruction-following. We hypothesise that in such settings,
applying MBR with standard similarity-based utility functions may result in
selecting responses that are broadly representative of the model's
distribution, yet sub-optimal with respect to any particular grouping of
generations that share an underlying latent structure. In this work, we
introduce three lightweight adaptations to the utility function, designed to
make MBR more sensitive to structural variability in the outcome space. To test
our hypothesis, we curate a dataset capturing three representative types of
latent structure: dialogue act, emotion, and response structure (e.g., a
sentence, a paragraph, or a list). We further propose two metrics to evaluate
the structural optimality of MBR. Our analysis demonstrates that common
similarity-based utility functions fall short by these metrics. In contrast,
our proposed adaptations considerably improve structural optimality. Finally,
we evaluate our approaches on real-world instruction-following benchmarks,
AlpacaEval and MT-Bench, and show that increased structural sensitivity
improves generation quality by up to 13.7 percentage points in win rate.
comment: EMNLP 2025 Camera-Ready
☆ Neural Diversity Regularizes Hallucinations in Small Models
Language models continue to hallucinate despite increases in parameters,
compute, and data. We propose neural diversity -- decorrelated parallel
representations -- as a principled mechanism that reduces hallucination rates
at fixed parameter and data budgets. Inspired by portfolio theory, where
uncorrelated assets reduce risk by $\sqrt{P}$, we prove hallucination
probability is bounded by representational correlation: $P(H) \leq
f(\sigma^2((1-\rho(P))/P + \rho(P)), \mu^2)$, which predicts that language
models need an optimal amount of neurodiversity. To validate this, we introduce
ND-LoRA (Neural Diversity Low-Rank Adaptation), combining parallel LoRA
adapters with Barlow Twins regularization, and demonstrate that ND-LoRA reduces
hallucinations by up to 25.6% (and 14.6% on average) without degrading general
accuracy. Ablations show LoRA adapters and regularization act synergistically,
causal interventions prove neurodiversity as the mediating factor and
correlational analyses indicate scale: a 0.1% neural correlation increase is
associated with a 3.8% hallucination increase. Finally, task-dependent
optimality emerges: different tasks require different amounts of optimal
neurodiversity. Together, our results highlight neural diversity as a third
axis of scaling -- orthogonal to parameters and data -- to improve the
reliability of language models at fixed budgets.
☆ Analyticup E-commerce Product Search Competition Technical Report from Team Tredence_AICOE
This study presents the multilingual e-commerce search system developed by
the Tredence_AICOE team. The competition features two multilingual relevance
tasks: Query-Category (QC) Relevance, which evaluates how well a user's search
query aligns with a product category, and Query-Item (QI) Relevance, which
measures the match between a multilingual search query and an individual
product listing. To ensure full language coverage, we performed data
augmentation by translating existing datasets into languages missing from the
development set, enabling training across all target languages. We fine-tuned
Gemma-3 12B and Qwen-2.5 14B model for both tasks using multiple strategies.
The Gemma-3 12B (4-bit) model achieved the best QC performance using original
and translated data, and the best QI performance using original, translated,
and minority class data creation. These approaches secured 4th place on the
final leaderboard, with an average F1-score of 0.8857 on the private test set.
☆ \textsc{CantoNLU}: A benchmark for Cantonese natural language understanding
Cantonese, although spoken by millions, remains under-resourced due to policy
and diglossia. To address this scarcity of evaluation frameworks for Cantonese,
we introduce \textsc{\textbf{CantoNLU}}, a benchmark for Cantonese natural
language understanding (NLU). This novel benchmark spans seven tasks covering
syntax and semantics, including word sense disambiguation, linguistic
acceptability judgment, language detection, natural language inference,
sentiment analysis, part-of-speech tagging, and dependency parsing. In addition
to the benchmark, we provide model baseline performance across a set of models:
a Mandarin model without Cantonese training, two Cantonese-adapted models
obtained by continual pre-training a Mandarin model on Cantonese text, and a
monolingual Cantonese model trained from scratch. Results show that
Cantonese-adapted models perform best overall, while monolingual models perform
better on syntactic tasks. Mandarin models remain competitive in certain
settings, indicating that direct transfer may be sufficient when Cantonese
domain data is scarce. We release all datasets, code, and model weights to
facilitate future research in Cantonese NLP.
comment: 13 pages, 1 figure
☆ The Reasoning Lingua Franca: A Double-Edged Sword for Multilingual AI
Large Reasoning Models (LRMs) achieve strong performance on mathematical,
scientific, and other question-answering tasks, but their multilingual
reasoning abilities remain underexplored. When presented with non-English
questions, LRMs often default to reasoning in English, raising concerns about
interpretability and the handling of linguistic and cultural nuances. We
systematically compare an LRM's reasoning in English versus the language of the
question. Our evaluation spans two tasks: MGSM and GPQA Diamond. Beyond
measuring answer accuracy, we also analyze cognitive attributes in the
reasoning traces. We find that English reasoning traces exhibit a substantially
higher presence of these cognitive behaviors, and that reasoning in English
generally yields higher final-answer accuracy, with the performance gap
increasing as tasks become more complex. However, this English-centric strategy
is susceptible to a key failure mode - getting "Lost in Translation," where
translation steps lead to errors that would have been avoided by question's
language reasoning.
comment: 14 pages, 13 figures, 5 tables
☆ Why Did Apple Fall To The Ground: Evaluating Curiosity In Large Language Model
Curiosity serves as a pivotal conduit for human beings to discover and learn
new knowledge. Recent advancements of large language models (LLMs) in natural
language processing have sparked discussions regarding whether these models
possess capability of curiosity-driven learning akin to humans. In this paper,
starting from the human curiosity assessment questionnaire Five-Dimensional
Curiosity scale Revised (5DCR), we design a comprehensive evaluation framework
that covers dimensions such as Information Seeking, Thrill Seeking, and Social
Curiosity to assess the extent of curiosity exhibited by LLMs. The results
demonstrate that LLMs exhibit a stronger thirst for knowledge than humans but
still tend to make conservative choices when faced with uncertain environments.
We further investigated the relationship between curiosity and thinking of
LLMs, confirming that curious behaviors can enhance the model's reasoning and
active learning abilities. These findings suggest that LLMs have the potential
to exhibit curiosity similar to that of humans, providing experimental support
for the future development of learning capabilities and innovative research in
LLMs.
☆ BUSTED at AraGenEval Shared Task: A Comparative Study of Transformer-Based Models for Arabic AI-Generated Text Detection
This paper details our submission to the Ara- GenEval Shared Task on Arabic
AI-generated text detection, where our team, BUSTED, se- cured 5th place. We
investigated the effec- tiveness of three pre-trained transformer mod- els:
AraELECTRA, CAMeLBERT, and XLM- RoBERTa. Our approach involved fine-tuning each
model on the provided dataset for a binary classification task. Our findings
revealed a sur- prising result: the multilingual XLM-RoBERTa model achieved the
highest performance with an F1 score of 0.7701, outperforming the spe- cialized
Arabic models. This work underscores the complexities of AI-generated text
detection and highlights the strong generalization capa- bilities of
multilingual models.
☆ What Defines Good Reasoning in LLMs? Dissecting Reasoning Steps with Multi-Aspect Evaluation
Evaluating large language models (LLMs) on final-answer correctness is the
dominant paradigm. This approach, however, provides a coarse signal for model
improvement and overlooks the quality of the underlying reasoning process. We
argue that a more granular evaluation of reasoning offers a more effective path
to building robust models. We decompose reasoning quality into two dimensions:
relevance and coherence. Relevance measures if a step is grounded in the
problem; coherence measures if it follows logically from prior steps. To
measure these aspects reliably, we introduce causal stepwise evaluation (CaSE).
This method assesses each reasoning step using only its preceding context,
which avoids hindsight bias. We validate CaSE against human judgments on our
new expert-annotated benchmarks, MRa-GSM8K and MRa-MATH. More importantly, we
show that curating training data with CaSE-evaluated relevance and coherence
directly improves final task performance. Our work provides a scalable
framework for analyzing, debugging, and improving LLM reasoning, demonstrating
the practical value of moving beyond validity checks.
☆ Can ChatGPT Code Communication Data Fairly?: Empirical Evidence from Multiple Collaborative Tasks
Assessing communication and collaboration at scale depends on a labor
intensive task of coding communication data into categories according to
different frameworks. Prior research has established that ChatGPT can be
directly instructed with coding rubrics to code the communication data and
achieves accuracy comparable to human raters. However, whether the coding from
ChatGPT or similar AI technology exhibits bias against different demographic
groups, such as gender and race, remains unclear. To fill this gap, this paper
investigates ChatGPT-based automated coding of communication data using a
typical coding framework for collaborative problem solving, examining
differences across gender and racial groups. The analysis draws on data from
three types of collaborative tasks: negotiation, problem solving, and decision
making. Our results show that ChatGPT-based coding exhibits no significant bias
across gender and racial groups, paving the road for its adoption in
large-scale assessment of collaboration and communication.
comment: 38 pages, 4 figures
☆ Beyond Retrieval-Ranking: A Multi-Agent Cognitive Decision Framework for E-Commerce Search
The retrieval-ranking paradigm has long dominated e-commerce search, but its
reliance on query-item matching fundamentally misaligns with multi-stage
cognitive decision processes of platform users. This misalignment introduces
critical limitations: semantic gaps in complex queries, high decision costs due
to cross-platform information foraging, and the absence of professional
shopping guidance. To address these issues, we propose a Multi-Agent Cognitive
Decision Framework (MACDF), which shifts the paradigm from passive retrieval to
proactive decision support. Extensive offline evaluations demonstrate MACDF's
significant improvements in recommendation accuracy and user satisfaction,
particularly for complex queries involving negation, multi-constraint, or
reasoning demands. Online A/B testing on JD search platform confirms its
practical efficacy. This work highlights the transformative potential of
multi-agent cognitive systems in redefining e-commerce search.
☆ GlobalRAG: Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning
Jinchang Luo, Mingquan Cheng, Fan Wan, Ni Li, Xiaoling Xia, Shuangshuang Tian, Tingcheng Bian, Haiwei Wang, Haohuan Fu, Yan Tao
Reinforcement learning has recently shown promise in improving
retrieval-augmented generation (RAG). Despite these advances, its effectiveness
in multi-hop question answering (QA) remains limited by two fundamental
limitations: (i) global planning absence to structure multi-step reasoning, and
(ii) unfaithful execution, which hinders effective query formulation and
consistent use of retrieved evidence. We propose GlobalRAG, a reinforcement
learning framework designed to enhance global reasoning in multi-hop QA.
GlobalRAG decomposes questions into subgoals, coordinates retrieval with
reasoning, and refines evidence iteratively. To guide this process, we
introduce Planning Quality Reward and SubGoal Completion Reward, which
encourage coherent planning and reliable subgoal execution. In addition, a
progressive weight annealing strategy balances process-oriented and
outcome-based objectives. Extensive experiments on both in-domain and
out-of-domain benchmarks demonstrate that GlobalRAG significantly outperforms
strong baselines while using only 8k training data (42% of the training data
used by strong baselines), achieving average improvements of 14.2% in both EM
and F1.
comment: 8 pages, 3 figures, 4 tables
☆ The Dog the Cat Chased Stumped the Model: Measuring When Language Models Abandon Structure for Shortcuts
When language models correctly parse "The cat that the dog chased meowed,"
are they analyzing syntax or simply familiar with dogs chasing cats? Despite
extensive benchmarking, we lack methods to distinguish structural understanding
from semantic pattern matching. We introduce CenterBench, a dataset of 9,720
comprehension questions on center-embedded sentences (like "The cat [that the
dog chased] meowed") where relative clauses nest recursively, creating
processing demands from simple to deeply nested structures. Each sentence has a
syntactically identical but semantically implausible counterpart (e.g., mailmen
prescribe medicine, doctors deliver mail) and six comprehension questions
testing surface understanding, syntactic dependencies, and causal reasoning.
Testing six models reveals that performance gaps between plausible and
implausible sentences widen systematically with complexity, with models showing
median gaps up to 26.8 percentage points, quantifying when they abandon
structural analysis for semantic associations. Notably, semantic plausibility
harms performance on questions about resulting actions, where following causal
relationships matters more than semantic coherence. Reasoning models improve
accuracy but their traces show semantic shortcuts, overthinking, and answer
refusal. Unlike models whose plausibility advantage systematically widens with
complexity, humans shows variable semantic effects. CenterBench provides the
first framework to identify when models shift from structural analysis to
pattern matching.
☆ ARC-Encoder: learning compressed text representations for large language models
Recent techniques such as retrieval-augmented generation or chain-of-thought
reasoning have led to longer contexts and increased inference costs. Context
compression techniques can reduce these costs, but the most effective
approaches require fine-tuning the target model or even modifying its
architecture. This can degrade its general abilities when not used for this
specific purpose. Here we explore an alternative approach: an encoder that
compresses the context into continuous representations which replace token
embeddings in decoder LLMs. First, we perform a systematic study of training
strategies and architecture choices for the encoder. Our findings led to the
design of an Adaptable text Representations Compressor, named ARC-Encoder,
which outputs $x$-times fewer continuous representations (typically
$x\!\in\!\{4,8\}$) than text tokens. We evaluate ARC-Encoder across a variety
of LLM usage scenarios, ranging from in-context learning to context window
extension, on both instruct and base decoders. Results show that ARC-Encoder
achieves state-of-the-art performance on several benchmarks while improving
computational efficiency at inference. Finally, we demonstrate that our models
can be adapted to multiple decoders simultaneously, allowing a single encoder
to generalize across different decoder LLMs. This makes ARC-Encoder a flexible
and efficient solution for portable encoders that work seamlessly with multiple
LLMs. We release a training code at https://github.com/kyutai-labs/ARC-Encoder
, fine-tuning dataset and pretrained models are available at
https://huggingface.co/collections/kyutai/arc-encoders-68ee18787301407d60a57047 .
☆ Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment ICASSP 2026
Recent speech-to-speech (S2S) models generate intelligible speech but still
lack natural expressiveness, largely due to the absence of a reliable
evaluation metric. Existing approaches, such as subjective MOS ratings,
low-level acoustic features, and emotion recognition are costly, limited, or
incomplete. To address this, we present DeEAR (Decoding the Expressive
Preference of eAR), a framework that converts human preference for speech
expressiveness into an objective score. Grounded in phonetics and psychology,
DeEAR evaluates speech across three dimensions: Emotion, Prosody, and
Spontaneity, achieving strong alignment with human perception (Spearman's Rank
Correlation Coefficient, SRCC = 0.86) using fewer than 500 annotated samples.
Beyond reliable scoring, DeEAR enables fair benchmarking and targeted data
curation. It not only distinguishes expressiveness gaps across S2S models but
also selects 14K expressive utterances to form ExpressiveSpeech, which improves
the expressive score (from 2.0 to 23.4 on a 100-point scale) of S2S models.
Demos and codes are available at
https://github.com/FreedomIntelligence/ExpressiveSpeech
comment: Submitted to ICASSP 2026. Demos and codes are available at
https://github.com/FreedomIntelligence/ExpressiveSpeech
☆ Assessing the Political Fairness of Multilingual LLMs: A Case Study based on a 21-way Multiparallel EuroParl Dataset
The political biases of Large Language Models (LLMs) are usually assessed by
simulating their answers to English surveys. In this work, we propose an
alternative framing of political biases, relying on principles of fairness in
multilingual translation. We systematically compare the translation quality of
speeches in the European Parliament (EP), observing systematic differences with
majority parties from left, center, and right being better translated than
outsider parties. This study is made possible by a new, 21-way multiparallel
version of EuroParl, the parliamentary proceedings of the EP, which includes
the political affiliations of each speaker. The dataset consists of 1.5M
sentences for a total of 40M words and 249M characters. It covers three years,
1000+ speakers, 7 countries, 12 EU parties, 25 EU committees, and hundreds of
national parties.
☆ Hierarchical Sequence Iteration for Heterogeneous Question Answering
Retrieval-augmented generation (RAG) remains brittle on multi-step questions
and heterogeneous evidence sources, trading accuracy against latency and
token/tool budgets. This paper introducesHierarchical Sequence (HSEQ) Iteration
for Heterogeneous Question Answering, a unified framework that (i) linearize
documents, tables, and knowledge graphs into a reversible hierarchical sequence
with lightweight structural tags, and (ii) perform structure-aware iteration to
collect just-enough evidence before answer synthesis. A Head Agent provides
guidance that leads retrieval, while an Iteration Agent selects and expands
HSeq via structure-respecting actions (e.g., parent/child hops, table
row/column neighbors, KG relations); Finally the head agent composes
canonicalized evidence to genearte the final answer, with an optional
refinement loop to resolve detected contradictions. Experiments on HotpotQA
(text), HybridQA/TAT-QA (table+text), and MetaQA (KG) show consistent EM/F1
gains over strong single-pass, multi-hop, and agentic RAG baselines with high
efficiency. Besides, HSEQ exhibits three key advantages: (1) a format-agnostic
unification that enables a single policy to operate across text, tables, and
KGs without per-dataset specialization; (2) guided, budget-aware iteration that
reduces unnecessary hops, tool calls, and tokens while preserving accuracy; and
(3) evidence canonicalization for reliable QA, improving answers consistency
and auditability.
comment: 22 pages, 3 figures
☆ Robust Preference Alignment via Directional Neighborhood Consensus ICLR 2026
Aligning large language models with human preferences is critical for
creating reliable and controllable AI systems. A human preference can be
visualized as a high-dimensional vector where different directions represent
trade-offs between desired attributes (e.g., helpfulness vs. verbosity). Yet,
because the training data often reflects dominant, average preferences, LLMs
tend to perform well on common requests but fall short in specific, individual
needs. This mismatch creates a preference coverage gap. Existing methods often
address this through costly retraining, which may not be generalized to the
full spectrum of diverse preferences. This brittleness means that when a user's
request reflects a nuanced preference deviating from the training data's
central tendency, model performance can degrade unpredictably. To address this
challenge, we introduce Robust Preference Selection (RPS), a post-hoc,
training-free method by leveraging directional neighborhood consensus. Instead
of forcing a model to generate a response from a single, highly specific
preference, RPS samples multiple responses from a local neighborhood of related
preferences to create a superior candidate pool. It then selects the response
that best aligns with the user's original intent. We provide a theoretical
framework showing our neighborhood generation strategy is provably superior to
a strong baseline that also samples multiple candidates. Comprehensive
experiments across three distinct alignment paradigms (DPA, DPO, and SFT)
demonstrate that RPS consistently improves robustness against this baseline,
achieving win rates of up to 69% on challenging preferences from
under-represented regions of the space without any model retraining. Our work
presents a practical, theoretically-grounded solution for enhancing the
reliability of preference-aligned models.
comment: Under review at ICLR 2026. 10 pages, 5 figures. Code and data
available at https://github.com/rcmao/robust-preference-alignment
☆ Steering Evaluation-Aware Language Models To Act Like They Are Deployed
Large language models (LLMs) can sometimes detect when they are being
evaluated and adjust their behavior to appear more aligned, compromising the
reliability of safety evaluations. In this paper, we show that adding a
steering vector to an LLM's activations can suppress evaluation-awareness and
make the model act like it is deployed during evaluation. To study our steering
technique, we train an LLM to exhibit evaluation-aware behavior using a
two-step training process designed to mimic how this behavior could emerge
naturally. First, we perform continued pretraining on documents with factual
descriptions of the model (1) using Python type hints during evaluation but not
during deployment and (2) recognizing that the presence of a certain evaluation
cue always means that it is being tested. Then, we train the model with expert
iteration to use Python type hints in evaluation settings. The resulting model
is evaluation-aware: it writes type hints in evaluation contexts more than
deployment contexts. However, this gap can only be observed by removing the
evaluation cue. We find that activation steering can suppress evaluation
awareness and make the model act like it is deployed even when the cue is
present. Importantly, we constructed our steering vector using the original
model before our additional training. Our results suggest that AI evaluators
could improve the reliability of safety evaluations by steering models to act
like they are deployed.
☆ RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via Hierarchical Model Merging
Bowen Wang, Haiyuan Wan, Liwen Shi, Chen Yang, Peng He, Yue Ma, Haochen Han, Wenhao Li, Tiao Tan, Yongjian Li, Fangming Liu, Yifan Gong, Sheng Zhang
We unveil that internal representations in large language models (LLMs) serve
as reliable proxies of learned knowledge, and propose RECALL, a novel
representation-aware model merging framework for continual learning without
access to historical data. RECALL computes inter-model similarity from
layer-wise hidden representations over clustered typical samples, and performs
adaptive, hierarchical parameter fusion to align knowledge across models. This
design enables the preservation of domain-general features in shallow layers
while allowing task-specific adaptation in deeper layers. Unlike prior methods
that require task labels or incur performance trade-offs, RECALL achieves
seamless multi-domain integration and strong resistance to catastrophic
forgetting. Extensive experiments across five NLP tasks and multiple continual
learning scenarios show that RECALL outperforms baselines in both knowledge
retention and generalization, providing a scalable and data-free solution for
evolving LLMs.
☆ Mask and You Shall Receive: Optimizing Masked Language Modeling For Pretraining BabyLMs
We describe our strategy for the 2025 edition of the BabyLM Challenge. Our
main contribution is that of an improved form of Masked Language Modeling
(MLM), which adapts the probabilities of the tokens masked according to the
model's ability to predict them. The results show a substantial increase in
performance on (Super)GLUE tasks over the standard MLM. We also incorporate
sub-token embeddings, finding that this increases the model's morphological
generalization capabilities. Our submission beats the baseline in the
strict-small track.
comment: Submission to the 2025 BabyLM Challenge
☆ Systematic Evaluation of Uncertainty Estimation Methods in Large Language Models
Large language models (LLMs) produce outputs with varying levels of
uncertainty, and, just as often, varying levels of correctness; making their
practical reliability far from guaranteed. To quantify this uncertainty, we
systematically evaluate four approaches for confidence estimation in LLM
outputs: VCE, MSP, Sample Consistency, and CoCoA (Vashurin et al., 2025). For
the evaluation of the approaches, we conduct experiments on four
question-answering tasks using a state-of-the-art open-source LLM. Our results
show that each uncertainty metric captures a different facet of model
confidence and that the hybrid CoCoA approach yields the best reliability
overall, improving both calibration and discrimination of correct answers. We
discuss the trade-offs of each method and provide recommendations for selecting
uncertainty measures in LLM applications.
☆ LM-mixup: Text Data Augmentation via Language Model based Mixup
Instruction tuning is crucial for aligning Large Language Models (LLMs), yet
the quality of instruction-following data varies significantly. While
high-quality data is paramount, it is often scarce; conversely, abundant
low-quality data is frequently discarded, leading to substantial information
loss. Existing data augmentation methods struggle to augment this low-quality
data effectively, and the evaluation of such techniques remains poorly defined.
To address this, we formally define the task of Instruction Distillation:
distilling multiple low-quality and redundant inputs into high-quality and
coherent instruction-output pairs. Specifically, we introduce a comprehensive
data construction pipeline to create MIXTURE, a 144K-sample dataset pairing
low-quality or semantically redundant imperfect instruction clusters with their
high-quality distillations. We then introduce LM-Mixup, by first performing
supervised fine-tuning on MIXTURE and then optimizing it with reinforcement
learning. This process uses three complementary reward signals: quality,
semantic alignment, and format compliance, via Group Relative Policy
Optimization (GRPO). We demonstrate that LM-Mixup effectively augments
imperfect datasets: fine-tuning LLMs on its distilled data, which accounts for
only about 3% of the entire dataset, not only surpasses full-dataset training
but also competes with state-of-the-art high-quality data selection methods
across multiple benchmarks. Our work establishes that low-quality data is a
valuable resource when properly distilled and augmented with LM-Mixup,
significantly enhancing the efficiency and performance of instruction-tuned
LLMs.
☆ Teacher Demonstrations in a BabyLM's Zone of Proximal Development for Contingent Multi-Turn Interaction EMNLP 2025
Suchir Salhan, Hongyi Gu, Donya Rooein, Diana Galvan-Sosa, Gabrielle Gaudeau, Andrew Caines, Zheng Yuan, Paula Buttery
Multi-turn dialogues between a child and a caregiver are characterized by a
property called contingency - that is, prompt, direct, and meaningful exchanges
between interlocutors. We introduce ContingentChat, a teacher-student framework
that benchmarks and improves multi-turn contingency in a BabyLM trained on 100M
words. Using a novel alignment dataset for post-training, BabyLM generates
responses that are more grammatical and cohesive. Experiments with adaptive
teacher decoding strategies show limited additional gains. ContingentChat
demonstrates the benefits of targeted post-training for dialogue quality and
indicates that contingency remains a challenging goal for BabyLMs.
comment: Outstanding Paper Award, EMNLP 2025 BabyLM Workshop - Oral
presentation, Suzhou, China
☆ Relative-Based Scaling Law for Neural Language Models
Scaling laws aim to accurately predict model performance across different
scales. Existing scaling-law studies almost exclusively rely on cross-entropy
as the evaluation metric. However, cross-entropy provides only a partial view
of performance: it measures the absolute probability assigned to the correct
token, but ignores the relative ordering between correct and incorrect tokens.
Yet, relative ordering is crucial for language models, such as in
greedy-sampling scenario. To address this limitation, we investigate scaling
from the perspective of relative ordering. We first propose the Relative-Based
Probability (RBP) metric, which quantifies the probability that the correct
token is ranked among the top predictions. Building on this metric, we
establish the Relative-Based Scaling Law, which characterizes how RBP improves
with increasing model size. Through extensive experiments on four datasets and
four model families spanning five orders of magnitude, we demonstrate the
robustness and accuracy of this law. Finally, we illustrate the broad
application of this law with two examples, namely providing a deeper
explanation of emergence phenomena and facilitating finding fundamental
theories of scaling laws. In summary, the Relative-Based Scaling Law
complements the cross-entropy perspective and contributes to a more complete
understanding of scaling large language models. Thus, it offers valuable
insights for both practical development and theoretical exploration.
☆ NeoDictaBERT: Pushing the Frontier of BERT models for Hebrew
Since their initial release, BERT models have demonstrated exceptional
performance on a variety of tasks, despite their relatively small size
(BERT-base has ~100M parameters). Nevertheless, the architectural choices used
in these models are outdated compared to newer transformer-based models such as
Llama3 and Qwen3. In recent months, several architectures have been proposed to
close this gap. ModernBERT and NeoBERT both show strong improvements on English
benchmarks and significantly extend the supported context window. Following
their successes, we introduce NeoDictaBERT and NeoDictaBERT-bilingual:
BERT-style models trained using the same architecture as NeoBERT, with a
dedicated focus on Hebrew texts. These models outperform existing ones on
almost all Hebrew benchmarks and provide a strong foundation for downstream
tasks. Notably, the NeoDictaBERT-bilingual model shows strong results on
retrieval tasks, outperforming other multilingual models of similar size. In
this paper, we describe the training process and report results across various
benchmarks. We release the models to the community as part of our goal to
advance research and development in Hebrew NLP.
☆ VLSP 2025 MLQA-TSR Challenge: Vietnamese Multimodal Legal Question Answering on Traffic Sign Regulation SP 2025
Son T. Luu, Trung Vo, Hiep Nguyen, Khanh Quoc Tran, Kiet Van Nguyen, Vu Tran, Ngan Luu-Thuy Nguyen, Le-Minh Nguyen
This paper presents the VLSP 2025 MLQA-TSR - the multimodal legal question
answering on traffic sign regulation shared task at VLSP 2025. VLSP 2025
MLQA-TSR comprises two subtasks: multimodal legal retrieval and multimodal
question answering. The goal is to advance research on Vietnamese multimodal
legal text processing and to provide a benchmark dataset for building and
evaluating intelligent systems in multimodal legal domains, with a focus on
traffic sign regulation in Vietnam. The best-reported results on VLSP 2025
MLQA-TSR are an F2 score of 64.55% for multimodal legal retrieval and an
accuracy of 86.30% for multimodal question answering.
comment: VLSP 2025 MLQA-TSR Share Task
☆ IKnow: Instruction-Knowledge-Aware Continual Pretraining for Effective Domain Adaptation
Continual pretraining promises to adapt large language models (LLMs) to new
domains using only unlabeled test-time data, but naively applying standard
self-supervised objectives to instruction-tuned models is known to degrade
their instruction-following capability and semantic representations. Existing
fixes assume access to the original base model or rely on knowledge from an
external domain-specific database - both of which pose a realistic barrier in
settings where the base model weights are withheld for safety reasons or
reliable external corpora are unavailable. In this work, we propose
Instruction-Knowledge-Aware Continual Adaptation (IKnow), a simple and general
framework that formulates novel self-supervised objectives in the
instruction-response dialogue format. Rather than depend- ing on external
resources, IKnow leverages domain knowledge embedded within the text itself and
learns to encode it at a deeper semantic level.
☆ The Impact of Negated Text on Hallucination with Large Language Models EMNLP 2025
Recent studies on hallucination in large language models (LLMs) have been
actively progressing in natural language processing. However, the impact of
negated text on hallucination with LLMs remains largely unexplored. In this
paper, we set three important yet unanswered research questions and aim to
address them. To derive the answers, we investigate whether LLMs can recognize
contextual shifts caused by negation and still reliably distinguish
hallucinations comparable to affirmative cases. We also design the NegHalu
dataset by reconstructing existing hallucination detection datasets with
negated expressions. Our experiments demonstrate that LLMs struggle to detect
hallucinations in negated text effectively, often producing logically
inconsistent or unfaithful judgments. Moreover, we trace the internal state of
LLMs as they process negated inputs at the token level and reveal the
challenges of mitigating their unintended effects.
comment: Accepted to the EMNLP 2025
☆ Dialogue Is Not Enough to Make a Communicative BabyLM (But Neither Is Developmentally Inspired Reinforcement Learning)
Francesca Padovani, Bastian Bunzeck, Manar Ali, Omar Momen, Arianna Bisazza, Hendrik Buschmeier, Sina Zarrieß
We investigate whether pre-training exclusively on dialogue data results in
formally and functionally apt small language models. Based on this pre-trained
llamalogue model, we employ a variety of fine-tuning strategies to enforce
"more communicative" text generations by our models. Although our models
underperform on most standard BabyLM benchmarks, they excel at dialogue
continuation prediction in a minimal pair setting. While PPO fine-tuning has
mixed to adversarial effects on our models, DPO fine-tuning further improves
their performance on our custom dialogue benchmark.
☆ FreeChunker: A Cross-Granularity Chunking Framework
Chunking strategies significantly impact the effectiveness of
Retrieval-Augmented Generation (RAG) systems. Existing methods operate within
fixed-granularity paradigms that rely on static boundary identification,
limiting their adaptability to diverse query requirements. This paper presents
FreeChunker, a Cross-Granularity Encoding Framework that fundamentally
transforms the traditional chunking paradigm: the framework treats sentences as
atomic units and shifts from static chunk segmentation to flexible retrieval
supporting arbitrary sentence combinations. This paradigm shift not only
significantly reduces the computational overhead required for semantic boundary
detection but also enhances adaptability to complex queries. Experimental
evaluation on LongBench V2 demonstrates that FreeChunker achieves superior
retrieval performance compared to traditional chunking methods, while
significantly outperforming existing approaches in computational efficiency.
comment: Submitted to arXiv, October 2025
☆ Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models
Large Language Models (LLMs) are increasingly evaluated on their ability to
reason over structured data, yet such assessments often overlook a crucial
confound: dataset contamination. In this work, we investigate whether LLMs
exhibit prior knowledge of widely used tabular benchmarks such as Adult Income,
Titanic, and others. Through a series of controlled probing experiments, we
reveal that contamination effects emerge exclusively for datasets containing
strong semantic cues-for instance, meaningful column names or interpretable
value categories. In contrast, when such cues are removed or randomized,
performance sharply declines to near-random levels. These findings suggest that
LLMs' apparent competence on tabular reasoning tasks may, in part, reflect
memorization of publicly available datasets rather than genuine generalization.
We discuss implications for evaluation protocols and propose strategies to
disentangle semantic leakage from authentic reasoning ability in future LLM
assessments.
☆ Teaching Language Models to Reason with Tools NIPS2025
Chengpeng Li, Zhengyang Tang, Ziniu Li, Mingfeng Xue, Keqin Bao, Tian Ding, Ruoyu Sun, Benyou Wang, Xiang Wang, Junyang Lin, Dayiheng Liu
Large reasoning models (LRMs) like OpenAI-o1 have shown impressive
capabilities in natural language reasoning. However, these models frequently
demonstrate inefficiencies or inaccuracies when tackling complex mathematical
operations. While integrating computational tools such as Code Interpreters
(CIs) offers a promising solution, it introduces a critical challenge: a
conflict between the model's internal, probabilistic reasoning and the
external, deterministic knowledge provided by the CI, which often leads models
to unproductive deliberation. To overcome this, we introduce CoRT
(Code-Optimized Reasoning Training), a post-training framework designed to
teach LRMs to effectively utilize CIs. We propose \emph{Hint-Engineering}, a
new data synthesis strategy that strategically injects diverse hints at optimal
points within reasoning paths. This approach generates high-quality,
code-integrated reasoning data specifically tailored to optimize LRM-CI
interaction. Using this method, we have synthesized 30 high-quality samples to
post-train models ranging from 1.5B to 32B parameters through supervised
fine-tuning. CoRT further refines the multi-round interleaving of external CI
usage and internal thinking by employing rejection sampling and reinforcement
learning. Our experimental evaluations demonstrate CoRT's effectiveness,
yielding absolute improvements of 4\% and 8\% on DeepSeek-R1-Distill-Qwen-32B
and DeepSeek-R1-Distill-Qwen-1.5B, respectively, across five challenging
mathematical reasoning datasets. Moreover, CoRT significantly enhances
efficiency, reducing token usage by approximately 30\% for the 32B model and
50\% for the 1.5B model compared to pure natural language reasoning baselines.
The models and code are available at: https://github.com/ChengpengLi1003/CoRT.
comment: NIPS2025 Accepted
☆ Exploring Generative Process Reward Modeling for Semi-Structured Data: A Case Study of Table Question Answering
Process reward models (PRMs) improve complex reasoning in large language
models (LLMs) by grading candidate solutions step-by-step and selecting answers
via aggregated step scores. While effective in domains such as mathematics,
their applicability to tasks involving semi-structured data, like table
question answering (TQA) remains unexplored. TQA poses unique challenges for
PRMs, including abundant irrelevant information, loosely connected reasoning
steps, and domain-specific reasoning. This work presents the first systematic
study of PRMs for TQA. We evaluate state-of-the-art generative PRMs on TQA from
both answer and step perspectives. Results show that PRMs that combine textual
and code verification can aid solution selection but struggle to generalize to
out-of-domain data. Analysis reveals a weak correlation between performance in
step-level verification and answer accuracy, possibly stemming from weak step
dependencies and loose causal links. Our findings highlight limitations of
current PRMs on TQA and offer valuable insights for building more robust,
process-aware verifiers.
☆ Citation Failure: Definition, Analysis and Efficient Mitigation
Citations from LLM-based RAG systems are supposed to simplify response
verification. However, this does not hold for citation failure, when a model
generates a helpful response, but fails to cite complete evidence. In contrast
to previous work, we propose to disentangle this from response failure, where
the response itself is flawed, and citing complete evidence is impossible. To
address citation failure, this work follows a two-step approach: (1) We study
when citation failure occurs and (2) how it can be mitigated. For step 1, we
extend prior work by investigating how the relation between response and
evidence affects citation quality. We introduce CITECONTROL, a benchmark that
systematically varies this relation to analyze failure modes. Experiments show
that failures increase with relational complexity and suggest that combining
citation methods could improve performance, motivating step 2. To improve LLM
citation efficiently, we propose CITENTION, a framework integrating generative,
attention-based, and retrieval-based methods. Results demonstrate substantial
citation improvements on CITECONTROL and in transfer settings. We make our data
and code publicly available.
comment: Under review. Paper repository:
https://github.com/UKPLab/arxiv2025-citation-failure
☆ Context-level Language Modeling by Learning Predictive Context Embeddings
Next-token prediction (NTP) is the cornerstone of modern large language
models (LLMs) pretraining, driving their unprecedented capabilities in text
generation, reasoning, and instruction following. However, the token-level
prediction limits the model's capacity to capture higher-level semantic
structures and long-range contextual relationships. To overcome this
limitation, we introduce \textbf{ContextLM}, a framework that augments standard
pretraining with an inherent \textbf{next-context prediction} objective. This
mechanism trains the model to learn predictive representations of multi-token
contexts, leveraging error signals derived from future token chunks. Crucially,
ContextLM achieves this enhancement while remaining fully compatible with the
standard autoregressive, token-by-token evaluation paradigm (e.g., perplexity).
Extensive experiments on the GPT2 and Pythia model families, scaled up to
$1.5$B parameters, show that ContextLM delivers consistent improvements in both
perplexity and downstream task performance. Our analysis indicates that
next-context prediction provides a scalable and efficient pathway to stronger
language modeling, yielding better long-range coherence and more effective
attention allocation with minimal computational overhead.
comment: 16pages,6 figures
☆ ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases
The tendency to find and exploit "shortcuts" to complete tasks poses
significant risks for reliable assessment and deployment of large language
models (LLMs). For example, an LLM agent with access to unit tests may delete
failing tests rather than fix the underlying bug. Such behavior undermines both
the validity of benchmark results and the reliability of real-world LLM coding
assistant deployments.
To quantify, study, and mitigate such behavior, we introduce ImpossibleBench,
a benchmark framework that systematically measures LLM agents' propensity to
exploit test cases. ImpossibleBench creates "impossible" variants of tasks from
existing benchmarks like LiveCodeBench and SWE-bench by introducing direct
conflicts between the natural-language specification and the unit tests. We
measure an agent's "cheating rate" as its pass rate on these impossible tasks,
where any pass necessarily implies a specification-violating shortcut.
As a practical framework, ImpossibleBench is not just an evaluation but a
versatile tool. We demonstrate its utility for: (1) studying model behaviors,
revealing more fine-grained details of cheating behaviors from simple test
modification to complex operator overloading; (2) context engineering, showing
how prompt, test access and feedback loop affect cheating rates; and (3)
developing monitoring tools, providing a testbed with verified deceptive
solutions. We hope ImpossibleBench serves as a useful framework for building
more robust and reliable LLM systems.
Our implementation can be found at
https://github.com/safety-research/impossiblebench.
☆ Calibrating Multimodal Consensus for Emotion Recognition
In recent years, Multimodal Emotion Recognition (MER) has made substantial
progress. Nevertheless, most existing approaches neglect the semantic
inconsistencies that may arise across modalities, such as conflicting emotional
cues between text and visual inputs. Besides, current methods are often
dominated by the text modality due to its strong representational capacity,
which can compromise recognition accuracy. To address these challenges, we
propose a model termed Calibrated Multimodal Consensus (CMC). CMC introduces a
Pseudo Label Generation Module (PLGM) to produce pseudo unimodal labels,
enabling unimodal pretraining in a self-supervised fashion. It then employs a
Parameter-free Fusion Module (PFM) and a Multimodal Consensus Router (MCR) for
multimodal finetuning, thereby mitigating text dominance and guiding the fusion
process toward a more reliable consensus. Experimental results demonstrate that
CMC achieves performance on par with or superior to state-of-the-art methods
across four datasets, CH-SIMS, CH-SIMS v2, CMU-MOSI, and CMU-MOSEI, and
exhibits notable advantages in scenarios with semantic inconsistencies on
CH-SIMS and CH-SIMS v2. The implementation of this work is publicly accessible
at https://github.com/gw-zhong/CMC.
☆ Tri-Modal Severity Fused Diagnosis across Depression and Post-traumatic Stress Disorders
Depression and post traumatic stress disorder (PTSD) often co-occur with
connected symptoms, complicating automated assessment, which is often binary
and disorder specific. Clinically useful diagnosis needs severity aware cross
disorder estimates and decision support explanations. Our unified tri modal
affective severity framework synchronizes and fuses interview text with
sentence level transformer embeddings, audio with log Mel statistics with
deltas, and facial signals with action units, gaze, head and pose descriptors
to output graded severities for diagnosing both depression (PHQ-8; 5 classes)
and PTSD (3 classes). Standardized features are fused via a calibrated late
fusion classifier, yielding per disorder probabilities and feature-level
attributions. This severity aware tri-modal affective fusion approach is demoed
on multi disorder concurrent depression and PTSD assessment. Stratified cross
validation on DAIC derived corpora outperforms unimodal/ablation baselines. The
fused model matches the strongest unimodal baseline on accuracy and weighted
F1, while improving decision curve utility and robustness under noisy or
missing modalities. For PTSD specifically, fusion reduces regression error and
improves class concordance. Errors cluster between adjacent severities; extreme
classes are identified reliably. Ablations show text contributes most to
depression severity, audio and facial cues are critical for PTSD, whereas
attributions align with linguistic and behavioral markers. Our approach offers
reproducible evaluation and clinician in the loop support for affective
clinical decision making.
☆ Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context
Large Vision-Language Models (LVLMs) have made significant progress in recent
years but are also prone to hallucination issues. They exhibit more
hallucinations in longer, free-form responses, often attributed to accumulated
uncertainties. In this paper, we ask: Does increased hallucination result
solely from length-induced errors, or is there a deeper underlying mechanism?
After a series of preliminary experiments and findings, we suggest that the
risk of hallucinations is not caused by length itself but by the increased
reliance on context for coherence and completeness in longer responses.
Building on these insights, we propose a novel "induce-detect-suppress"
framework that actively induces hallucinations through deliberately designed
contexts, leverages induced instances for early detection of high-risk cases,
and ultimately suppresses potential object-level hallucinations during actual
decoding. Our approach achieves consistent, significant improvements across all
benchmarks, demonstrating its efficacy. The strong detection and improved
hallucination mitigation not only validate our framework but, more importantly,
re-validate our hypothesis on context. Rather than solely pursuing performance
gains, this study aims to provide new insights and serves as a first step
toward a deeper exploration of hallucinations in LVLMs' longer responses.
☆ Decoding-Free Sampling Strategies for LLM Marginalization
Modern language models operate on subword-tokenized text in order to make a
trade-off between model size, inference speed, and vocabulary coverage. A side
effect of this is that, during inference, models are evaluated by measuring the
probability of only the specific tokenization produced as the output, despite
there being many possible ways to represent the same text with a subword
vocabulary. Recent studies have argued instead for evaluating LLMs by
marginalization - the probability mass of all tokenizations of a given text.
Marginalization is difficult due to the number of possible tokenizations of a
text, so often approximate marginalization is done via sampling. However, a
downside of sampling is that an expensive generation step must be performed by
the LLM for each sample, which limits the number of samples that can be
acquired given a runtime budget, and therefore also the accuracy of the
approximation. Since computing the probability of a sequence given the
tokenization is relatively cheap compared to actually generating it, we
investigate sampling strategies that are decoding-free - they require no
generation from the LLM, instead relying entirely on extremely cheap sampling
strategies that are model and tokenizer agnostic.
We investigate the approximation quality and speed of decoding-free sampling
strategies for a number of open models to find that they provide sufficiently
accurate marginal estimates at a small fraction of the runtime cost and
demonstrate its use on a set of downstream inference tasks.
comment: 10 pages, 3 figures
☆ Stuck in the Matrix: Probing Spatial Reasoning in Large Language Models
This paper explores the spatial reasoning capability of large language models
(LLMs) over textual input through a suite of five tasks aimed at probing their
spatial understanding and computational abilities. The models were tested on
both fundamental spatial reasoning and multi-step problem-solving within
structured grid-based environments using tasks such as quadrant identification,
geometric transformations, distance evaluation, word searches, and tile
sliding. Each task was scaled in complexity through increasing grid dimensions,
requiring models to extend beyond simple pattern recognition into abstract
spatial reasoning. Our results reveal that while LLMs demonstrate moderate
success in all tasks with small complexity and size, performance drops off
rapidly as scale increases, with an average loss in accuracy of 42.7%, and
reaching as high as 84%. Every test that began with over 50% accuracy showed a
loss of at least 48%, illustrating the consistent nature of the deterioration.
Furthermore, their struggles with scaling complexity hint at a lack of robust
spatial representations in their underlying architectures. This paper
underscores the gap between linguistic and spatial reasoning in LLMs, offering
insights into their current limitations, and laying the groundwork for future
integrative benchmarks at the intersection of language and geometry.
comment: 20 pages, 24 figures
☆ Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures
Question Answering (QA) systems have traditionally relied on structured text
data, but the rapid growth of multimedia content (images, audio, video, and
structured metadata) has introduced new challenges and opportunities for
retrieval-augmented QA. In this survey, we review recent advancements in QA
systems that integrate multimedia retrieval pipelines, focusing on
architectures that align vision, language, and audio modalities with user
queries. We categorize approaches based on retrieval methods, fusion
techniques, and answer generation strategies, and analyze benchmark datasets,
evaluation protocols, and performance tradeoffs. Furthermore, we highlight key
challenges such as cross-modal alignment, latency-accuracy tradeoffs, and
semantic grounding, and outline open problems and future research directions
for building more robust and context-aware QA systems leveraging multimedia
data.
comment: In Proceedings of the 2nd ACM Workshop in AI-powered Question and
Answering Systems (AIQAM '25), October 27-28, 2025, Dublin, Ireland. ACM, New
York, NY, USA, 8 pages. https://doi.org/10.1145/3746274.3760393
☆ Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values
We propose Reinforcement Learning with Explicit Human Values (RLEV), a method
that aligns Large Language Model (LLM) optimization directly with quantifiable
human value signals. While Reinforcement Learning with Verifiable Rewards
(RLVR) effectively trains models in objective domains using binary correctness
rewards, it overlooks that not all tasks are equally significant. RLEV extends
this framework by incorporating human-defined value signals directly into the
reward function. Using exam-style data with explicit ground-truth value labels,
RLEV consistently outperforms correctness-only baselines across multiple RL
algorithms and model scales. Crucially, RLEV policies not only improve
value-weighted accuracy but also learn a value-sensitive termination policy:
concise for low-value prompts, thorough for high-value ones. We demonstrate
this behavior stems from value-weighted gradient amplification on
end-of-sequence tokens. Ablation studies confirm the gain is causally linked to
value alignment. RLEV remains robust under noisy value signals, such as
difficulty-based labels, demonstrating that optimizing for an explicit utility
function offers a practical path to aligning LLMs with human priorities.
comment: 15 pages, 4 figures
☆ Mixture-of-Minds: Multi-Agent Reinforcement Learning for Table Understanding
Yuhang Zhou, Mingrui Zhang, Ke Li, Mingyi Wang, Qiao Liu, Qifei wang, Jiayi Liu, Fei Liu, Serena Li, Weiwi Li, Mingze Gao, Abhishek Kumar, Xiangjun Fan, Zhuokai Zhao, Lizhu Zhang
Understanding and reasoning over tables is a critical capability for many
real-world applications. Large language models (LLMs) have shown promise on
this task, but current approaches remain limited. Fine-tuning based methods
strengthen language reasoning; yet they are prone to arithmetic errors and
hallucination. In contrast, tool-based methods enable precise table
manipulation but rely on rigid schemas and lack semantic understanding. These
complementary drawbacks highlight the need for approaches that integrate robust
reasoning with reliable table processing. In this work, we propose
Mixture-of-Minds, a multi-agent framework that decomposes table reasoning into
three specialized roles: planning, coding, and answering. This design enables
each agent to focus on a specific aspect of the task while leveraging code
execution for precise table manipulation. Building on this workflow, we
introduce a self-improvement training framework that employs Monte Carlo Tree
Search (MCTS) rollouts to generate pseudo-gold trajectories and optimize agents
with reinforcement learning (RL). Extensive experiments show that
Mixture-of-Minds delivers substantial gains, reaching 62.13% on TableBench and
surpassing OpenAI-o4-mini-high. These results demonstrate the promise of
combining structured multi-agent workflows with RL to advance table
understanding.
comment: 18 pages, 4 figures
☆ DeepWideSearch: Benchmarking Depth and Width in Agentic Information Seeking
Tian Lan, Bin Zhu, Qianghuai Jia, Junyang Ren, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, Kaifu Zhang
Current search agents fundamentally lack the ability to simultaneously
perform \textit{deep} reasoning over multi-hop retrieval and
\textit{wide}-scale information collection-a critical deficiency for real-world
applications like comprehensive market analysis and business development. To
bridge this gap, we introduce DeepWideSearch, the first benchmark explicitly
designed to evaluate agents to integrate depth and width in information
seeking. In DeepWideSearch, agents must process a large volume of data, each
requiring deep reasoning over multi-hop retrieval paths. Specifically, we
propose two methods to converse established datasets, resulting in a curated
collection of 220 questions spanning 15 diverse domains. Extensive experiments
demonstrate that even state-of-the-art agents achieve only 2.39% average
success rate on DeepWideSearch, highlighting the substantial challenge of
integrating depth and width search in information-seeking tasks. Furthermore,
our error analysis reveals four failure modes: lack of reflection, overreliance
on internal knowledge, insufficient retrieval, and context overflow-exposing
key limitations in current agent architectures. We publicly release
DeepWideSearch to catalyze future research on more capable and robust
information-seeking agents.
☆ Are Stereotypes Leading LLMs' Zero-Shot Stance Detection ? EMNLP 2025
Large Language Models inherit stereotypes from their pretraining data,
leading to biased behavior toward certain social groups in many Natural
Language Processing tasks, such as hateful speech detection or sentiment
analysis. Surprisingly, the evaluation of this kind of bias in stance detection
methods has been largely overlooked by the community. Stance Detection involves
labeling a statement as being against, in favor, or neutral towards a specific
target and is among the most sensitive NLP tasks, as it often relates to
political leanings. In this paper, we focus on the bias of Large Language
Models when performing stance detection in a zero-shot setting. We
automatically annotate posts in pre-existing stance detection datasets with two
attributes: dialect or vernacular of a specific group and text
complexity/readability, to investigate whether these attributes influence the
model's stance detection decisions. Our results show that LLMs exhibit
significant stereotypes in stance detection tasks, such as incorrectly
associating pro-marijuana views with low text complexity and African American
dialect with opposition to Donald Trump.
comment: Accepted in EMNLP 2025 (Main)
☆ BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation
Haoyuan Li, Zhengyuan Shen, Sullam Jeoung, Yueyan Chen, Jiayu Li, Qi Zhu, Shuai Wang, Vassilis Ioannidis, Huzefa Rangwala
As structured texts become increasingly complex across diverse domains --
from technical reports to generative AI prompts -- the need for text
segmentation into semantically meaningful components becomes critical. Such
texts often contain elements beyond plain language, including tables, code
snippets, and placeholders, which conventional sentence- or paragraph-level
segmentation methods cannot handle effectively. To address this challenge, we
propose BoundRL, a novel and efficient approach that jointly performs
token-level text segmentation and label prediction for long structured texts.
Instead of generating complete contents for each segment, it generates only a
sequence of starting tokens and reconstructs the complete contents by locating
these tokens within the original texts, thereby reducing inference costs by
orders of magnitude and minimizing hallucination. To adapt the model for the
output format, BoundRL~performs reinforcement learning with verifiable rewards
(RLVR) with a specifically designed reward that jointly optimizes document
reconstruction fidelity and semantic alignment. To mitigate entropy collapse,
it further constructs intermediate candidates by systematically perturbing a
fraction of generated sequences of segments to create stepping stones toward
higher-quality solutions. To demonstrate BoundRL's effectiveness on
particularly challenging structured texts, we focus evaluation on complex
prompts used for LLM applications. Experiments show that BoundRL enables small
language models (1.7B parameters) to outperform few-shot prompting of much
larger models. Moreover, RLVR with our designed reward yields significant
improvements over supervised fine-tuning, and incorporating intermediate
candidates further improves both performance and generalization.
☆ AI PB: A Grounded Generative Agent for Personalized Investment Insights
Daewoo Park, Suho Park, Inseok Hong, Hanwool Lee, Junkyu Park, Sangjun Lee, Jeongman An, Hyunbin Loh
We present AI PB, a production-scale generative agent deployed in real retail
finance. Unlike reactive chatbots that answer queries passively, AI PB
proactively generates grounded, compliant, and user-specific investment
insights. It integrates (i) a component-based orchestration layer that
deterministically routes between internal and external LLMs based on data
sensitivity, (ii) a hybrid retrieval pipeline using OpenSearch and the
finance-domain embedding model, and (iii) a multi-stage recommendation
mechanism combining rule heuristics, sequential behavioral modeling, and
contextual bandits. Operating fully on-premises under Korean financial
regulations, the system employs Docker Swarm and vLLM across 24 X NVIDIA H100
GPUs. Through human QA and system metrics, we demonstrate that grounded
generation with explicit routing and layered safety can deliver trustworthy AI
insights in high-stakes finance.
comment: Under Review
☆ Leveraging the Power of Large Language Models in Entity Linking via Adaptive Routing and Targeted Reasoning
Yajie Li, Albert Galimov, Mitra Datta Ganapaneni, Pujitha Thejaswi, De Meng, Priyanshu Kumar, Saloni Potdar
Entity Linking (EL) has traditionally relied on large annotated datasets and
extensive model fine-tuning. While recent few-shot methods leverage large
language models (LLMs) through prompting to reduce training requirements, they
often suffer from inefficiencies due to expensive LLM-based reasoning. ARTER
(Adaptive Routing and Targeted Entity Reasoning) presents a structured pipeline
that achieves high performance without deep fine-tuning by strategically
combining candidate generation, context-based scoring, adaptive routing, and
selective reasoning. ARTER computes a small set of complementary signals(both
embedding and LLM-based) over the retrieved candidates to categorize contextual
mentions into easy and hard cases. The cases are then handled by a
low-computational entity linker (e.g. ReFinED) and more expensive targeted
LLM-based reasoning respectively. On standard benchmarks, ARTER outperforms
ReFinED by up to +4.47%, with an average gain of +2.53% on 5 out of 6 datasets,
and performs comparably to pipelines using LLM-based reasoning for all
mentions, while being as twice as efficient in terms of the number of LLM
tokens.
☆ BIOCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models
Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G. Campolongo, Matthew J. Thompson, Net Zhang, Samuel Stevens, Hilmar Lapp, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao, Jianyang Gu
This work investigates descriptive captions as an additional source of
supervision for biological multimodal foundation models. Images and captions
can be viewed as complementary samples from the latent morphospace of a
species, each capturing certain biological traits. Incorporating captions
during training encourages alignment with this shared latent structure,
emphasizing potentially diagnostic characters while suppressing spurious
correlations. The main challenge, however, lies in obtaining faithful,
instance-specific captions at scale. This requirement has limited the
utilization of natural language supervision in organismal biology compared with
many other scientific domains. We complement this gap by generating synthetic
captions with multimodal large language models (MLLMs), guided by
Wikipedia-derived visual information and taxon-tailored format examples. These
domain-specific contexts help reduce hallucination and yield accurate,
instance-based descriptive captions. Using these captions, we train BIOCAP
(i.e., BIOCLIP with Captions), a biological foundation model that captures rich
semantics and achieves strong performance in species classification and
text-image retrieval. These results demonstrate the value of descriptive
captions beyond labels in bridging biological images with multimodal foundation
models.
comment: Project page: https://imageomics.github.io/biocap/
☆ CreativityPrism: A Holistic Benchmark for Large Language Model Creativity
Zhaoyi Joey Hou, Bowei Alvin Zhang, Yining Lu, Bhiman Kumar Baghel, Anneliese Brei, Ximing Lu, Meng Jiang, Faeze Brahman, Snigdha Chaturvedi, Haw-Shiuan Chang, Daniel Khashabi, Xiang Lorraine Li
Creativity is often seen as a hallmark of human intelligence. While large
language models (LLMs) are increasingly perceived as producing creative text,
there is still no holistic framework to evaluate their creativity across
diverse scenarios. Existing evaluation methods remain fragmented, with dramatic
variation across domains and tasks, largely due to differing definitions and
measurements of creativity. Inspired by the hypothesis that creativity is not
one fixed idea, we propose CreativityPrism, an evaluation analysis framework
that decomposes creativity into three dimensions: quality, novelty, and
diversity. CreativityPrism incorporates nine tasks, three domains, i.e.,
divergent thinking, creative writing, and logical reasoning, and twenty
evaluation metrics, which measure each dimension in task-specific, unique ways.
We evaluate 17 state-of-the-art (SoTA) proprietary and open-sourced LLMs on
CreativityPrism and analyze the performance correlations among different
metrics and task domains. Our results reveal a notable gap between proprietary
and open-source models. Overall, model performance tends to be highly
correlated across tasks within the same domain and less so across different
domains. Among evaluation dimensions, diversity and quality metrics show strong
correlations - models that perform well on one often excel on the other -
whereas novelty exhibits much weaker correlation with either. These findings
support our hypothesis that strong performance in one creativity task or
dimension does not necessarily generalize to others, underscoring the need for
a holistic evaluation of LLM creativity.
♻ ☆ Language Models use Lookbacks to Track Beliefs
Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov, Tamar Rott Shaham, David Bau, Atticus Geiger
How do language models (LMs) represent characters' beliefs, especially when
those beliefs may differ from reality? This question lies at the heart of
understanding the Theory of Mind (ToM) capabilities of LMs. We analyze LMs'
ability to reason about characters' beliefs using causal mediation and
abstraction. We construct a dataset, CausalToM, consisting of simple stories
where two characters independently change the state of two objects, potentially
unaware of each other's actions. Our investigation uncovers a pervasive
algorithmic pattern that we call a lookback mechanism, which enables the LM to
recall important information when it becomes necessary. The LM binds each
character-object-state triple together by co-locating their reference
information, represented as Ordering IDs (OIs), in low-rank subspaces of the
state token's residual stream. When asked about a character's beliefs regarding
the state of an object, the binding lookback retrieves the correct state OI and
then the answer lookback retrieves the corresponding state token. When we
introduce text specifying that one character is (not) visible to the other, we
find that the LM first generates a visibility ID encoding the relation between
the observing and the observed character OIs. In a visibility lookback, this ID
is used to retrieve information about the observed character and update the
observing character's beliefs. Our work provides insights into belief tracking
mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.
comment: 31 pages, 33 figures. Code and data at https://belief.baulab.info/
♻ ☆ Text2Mem: A Unified Memory Operation Language for Memory Operating System
Large language model agents increasingly depend on memory to sustain long
horizon interaction, but existing frameworks remain limited. Most expose only a
few basic primitives such as encode, retrieve, and delete, while higher order
operations like merge, promote, demote, split, lock, and expire are missing or
inconsistently supported. Moreover, there is no formal and executable
specification for memory commands, leaving scope and lifecycle rules implicit
and causing unpredictable behavior across systems. We introduce Text2Mem, a
unified memory operation language that provides a standardized pathway from
natural language to reliable execution. Text2Mem defines a compact yet
expressive operation set aligned with encoding, storage, and retrieval. Each
instruction is represented as a JSON based schema instance with required fields
and semantic invariants, which a parser transforms into typed operation objects
with normalized parameters. A validator ensures correctness before execution,
while adapters map typed objects either to a SQL prototype backend or to real
memory frameworks. Model based services such as embeddings or summarization are
integrated when required. All results are returned through a unified execution
contract. This design ensures safety, determinism, and portability across
heterogeneous backends. We also outline Text2Mem Bench, a planned benchmark
that separates schema generation from backend execution to enable systematic
evaluation. Together, these components establish the first standardized
foundation for memory control in agents.
comment: 12 pages, 3 figures, 2 tables
♻ ☆ FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts NeurIPS 2025
Low-Rank Adaptation (LoRA) is a widely used parameter-efficient fine-tuning
method for foundation models, but it suffers from parameter interference,
resulting in suboptimal performance. Although Mixture-of-Experts (MoE)-based
LoRA variants show promise in mitigating intra-task correlations in single-task
instruction tuning, they introduce additional router parameters and remain
ineffective in multi-task model merging where inter-task interference arises.
Inspired by the fly olfactory circuit, we propose FlyLoRA, an implicit
MoE-based LoRA variant that introduces: (1) rank-wise expert activation in the
up-projection matrix, and (2) an implicit router that unifies expert routing
and down-projection, where a frozen sparse random projection matrix replaces
the traditional dense trainable version. This design resolves the trade-off
between intra-task decorrelation and computational efficiency by eliminating
the need for an explicit router, while inherently mitigating inter-task
interference due to the orthogonality property of random matrices. Extensive
experiments across four domains -- general knowledge understanding, scientific
question answering, mathematical reasoning, and code generation -- demonstrate
consistent performance improvements over existing methods. Beyond empirical
gains, FlyLoRA highlights how biological structures can inspire innovations in
AI technologies. Code is available at https://github.com/gfyddha/FlyLoRA.
comment: NeurIPS 2025 accepted paper
♻ ☆ Integrating Structural and Semantic Signals in Text-Attributed Graphs with BiGTex
Text-attributed graphs (TAGs) present unique challenges in representation
learning by requiring models to capture both the semantic richness of
node-associated texts and the structural dependencies of the graph. While graph
neural networks (GNNs) excel at modeling topological information, they lack the
capacity to process unstructured text. Conversely, large language models (LLMs)
are proficient in text understanding but are typically unaware of graph
structure. In this work, we propose BiGTex (Bidirectional Graph Text), a novel
architecture that tightly integrates GNNs and LLMs through stacked Graph-Text
Fusion Units. Each unit allows for mutual attention between textual and
structural representations, enabling information to flow in both directions,
text influencing structure and structure guiding textual interpretation. The
proposed architecture is trained using parameter-efficient fine-tuning (LoRA),
keeping the LLM frozen while adapting to task-specific signals. Extensive
experiments on five benchmark datasets demonstrate that BiGTex achieves
state-of-the-art performance in node classification and generalizes effectively
to link prediction. An ablation study further highlights the importance of soft
prompting and bi-directional attention in the model's success.
comment: 26 pages, 4 figures
♻ ☆ Blockwise SFT for Diffusion Language Models: Reconciling Bidirectional Attention and Autoregressive Decoding
Discrete diffusion language models have shown strong potential for text
generation, yet standard supervised fine-tuning (SFT) misaligns with their
semi-autoregressive inference: training randomly masks tokens across the entire
response, while inference generates fixed-size blocks sequentially. This
mismatch introduces noisy prefixes and leaky suffixes, biasing gradients away
from the desired blockwise likelihood. We propose Blockwise SFT, which
partitions responses into fixed-size blocks, selects one active block per step
for stochastic masking, freezes all preceding tokens, and fully hides future
ones. Loss is computed only over the active block, directly mirroring the
blockwise decoding process. Experiments on GSM8K, MATH, and MetaMathQA show
consistent gains over classical SFT under equal compute or token budgets. Block
size consistency studies and ablations confirm that improvements stem from
faithful training-inference alignment rather than incidental masking effects.
Our results highlight the importance of matching supervision granularity to the
decoding procedure in diffusion-based language models.
♻ ☆ Fast-Slow Thinking GRPO for Large Vision-Language Model Reasoning
When applying reinforcement learning--typically through GRPO--to large
vision-language model reasoning struggles to effectively scale reasoning length
or generates verbose outputs across all tasks with only marginal gains in
accuracy. To address this issue, we present FAST-GRPO, a variant of GRPO that
dynamically adapts reasoning depth based on question characteristics. Through
empirical analysis, we establish the feasibility of fast-slow thinking in LVLMs
by investigating how response length and data distribution affect performance.
Inspired by these observations, we introduce two complementary metrics to
estimate the difficulty of the questions, guiding the model to determine when
fast or slow thinking is more appropriate. Next, we incorporate adaptive
length-based rewards and difficulty-aware KL divergence into the GRPO
algorithm. Experiments across seven reasoning benchmarks demonstrate that FAST
achieves state-of-the-art accuracy with over 10\% relative improvement compared
to the base model, while reducing token usage by 32.7-67.3\% compared to
previous slow-thinking approaches, effectively balancing reasoning length and
accuracy.
♻ ☆ On the Emergence of Linear Analogies in Word Embeddings NeurIPS 2025
Models such as Word2Vec and GloVe construct word embeddings based on the
co-occurrence probability $P(i,j)$ of words $i$ and $j$ in text corpora. The
resulting vectors $W_i$ not only group semantically similar words but also
exhibit a striking linear analogy structure -- for example, $W_{\text{king}} -
W_{\text{man}} + W_{\text{woman}} \approx W_{\text{queen}}$ -- whose
theoretical origin remains unclear. Previous observations indicate that this
analogy structure: (i) already emerges in the top eigenvectors of the matrix
$M(i,j) = P(i,j)/P(i)P(j)$, (ii) strengthens and then saturates as more
eigenvectors of $M (i, j)$, which controls the dimension of the embeddings, are
included, (iii) is enhanced when using $\log M(i,j)$ rather than $M(i,j)$, and
(iv) persists even when all word pairs involved in a specific analogy relation
(e.g., king-queen, man-woman) are removed from the corpus. To explain these
phenomena, we introduce a theoretical generative model in which words are
defined by binary semantic attributes, and co-occurrence probabilities are
derived from attribute-based interactions. This model analytically reproduces
the emergence of linear analogy structure and naturally accounts for properties
(i)-(iv). It can be viewed as giving fine-grained resolution into the role of
each additional embedding dimension. It is robust to various forms of noise and
agrees well with co-occurrence statistics measured on Wikipedia and the analogy
benchmark introduced by Mikolov et al.
comment: Main: 10 pages, 3 figures. Appendices: 11 pages, 7 figures. Accepted
at NeurIPS 2025 as a poster
♻ ☆ Superposition Yields Robust Neural Scaling NeurIPS 2025
The success of today's large language models (LLMs) depends on the
observation that larger models perform better. However, the origin of this
neural scaling law, that loss decreases as a power law with model size, remains
unclear. We propose that representation superposition, meaning that LLMs
represent more features than they have dimensions, can be a key contributor to
loss and cause neural scaling. Based on Anthropic's toy model, we use weight
decay to control the degree of superposition, allowing us to systematically
study how loss scales with model size. When superposition is weak, the loss
follows a power law only if data feature frequencies are power-law distributed.
In contrast, under strong superposition, the loss generically scales inversely
with model dimension across a broad class of frequency distributions, due to
geometric overlaps between representation vectors. We confirmed that
open-sourced LLMs operate in the strong superposition regime and have loss
scaling like one over the model dimension, and that the Chinchilla scaling laws
are also consistent with this behavior. Our results identify representation
superposition as a central driver of neural scaling laws, providing insights
into questions like when neural scaling laws can be improved and when they will
break down.
comment: Accepted at NeurIPS 2025
♻ ☆ ReDit: Reward Dithering for Improved LLM Policy Optimization
DeepSeek-R1 has successfully enhanced Large Language Model (LLM) reasoning
capabilities through its rule-based reward system. While it's a ''perfect''
reward system that effectively mitigates reward hacking, such reward functions
are often discrete. Our experimental observations suggest that discrete rewards
can lead to gradient anomaly, unstable optimization, and slow convergence. To
address this issue, we propose ReDit (Reward Dithering), a method that dithers
the discrete reward signal by adding simple random noise. With this perturbed
reward, exploratory gradients are continuously provided throughout the learning
process, enabling smoother gradient updates and accelerating convergence. The
injected noise also introduces stochasticity into flat reward regions,
encouraging the model to explore novel policies and escape local optima.
Experiments across diverse tasks demonstrate the effectiveness and efficiency
of ReDit. On average, ReDit achieves performance comparable to vanilla GRPO
with only approximately 10% the training steps, and furthermore, still exhibits
a 4% performance improvement over vanilla GRPO when trained for a similar
duration. Visualizations confirm significant mitigation of gradient issues with
ReDit. Moreover, theoretical analyses are provided to further validate these
advantages.
comment: 34 pages, 19 figures
♻ ☆ X-Reflect: Cross-Reflection Prompting for Multimodal Recommendation
Large Language Models (LLMs) have been shown to enhance the effectiveness of
enriching item descriptions, thereby improving the accuracy of recommendation
systems. However, most existing approaches either rely on text-only prompting
or employ basic multimodal strategies that do not fully exploit the
complementary information available from both textual and visual modalities.
This paper introduces a novel framework, Cross-Reflection Prompting, termed
X-Reflect, designed to address these limitations by prompting Multimodal Large
Language Models (MLLMs) to explicitly identify and reconcile supportive and
conflicting information between text and images. By capturing nuanced insights
from both modalities, this approach generates more comprehensive and
contextually rich item representations. Extensive experiments conducted on two
widely used benchmarks demonstrate that our method outperforms existing
prompting baselines in downstream recommendation accuracy. Furthermore, we
identify a U-shaped relationship between text-image dissimilarity and
recommendation performance, suggesting the benefit of applying multimodal
prompting selectively. To support efficient real-time inference, we also
introduce X-Reflect-keyword, a lightweight variant that summarizes image
content using keywords and replaces the base model with a smaller backbone,
achieving nearly 50% reduction in input length while maintaining competitive
performance. This work underscores the importance of integrating multimodal
information and presents an effective solution for improving item understanding
in multimodal recommendation systems.
♻ ☆ BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning NeurIPS 2025
Jianyang Gu, Samuel Stevens, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Jiaman Wu, Andrei Kopanev, Zheda Mai, Alexander E. White, James Balhoff, Wasila Dahdul, Daniel Rubenstein, Hilmar Lapp, Tanya Berger-Wolf, Wei-Lun Chao, Yu Su
Foundation models trained at scale exhibit remarkable emergent behaviors,
learning new capabilities beyond their initial training objectives. We find
such emergent behaviors in biological vision models via large-scale contrastive
vision-language training. To achieve this, we first curate TreeOfLife-200M,
comprising 214 million images of living organisms, the largest and most diverse
biological organism image dataset to date. We then train BioCLIP 2 on
TreeOfLife-200M to distinguish different species. Despite the narrow training
objective, BioCLIP 2 yields extraordinary accuracy when applied to various
biological visual tasks such as habitat classification and trait prediction. We
identify emergent properties in the learned embedding space of BioCLIP 2. At
the inter-species level, the embedding distribution of different species aligns
closely with functional and ecological meanings (e.g., beak sizes and
habitats). At the intra-species level, instead of being diminished, the
intra-species variations (e.g., life stages and sexes) are preserved and better
separated in subspaces orthogonal to inter-species distinctions. We provide
formal proof and analyses to explain why hierarchical supervision and
contrastive objectives encourage these emergent properties. Crucially, our
results reveal that these properties become increasingly significant with
larger-scale training data, leading to a biologically meaningful embedding
space.
comment: NeurIPS 2025 Spotlight; Project page:
https://imageomics.github.io/bioclip-2/
♻ ☆ Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Reinforcement Learning with Verifiable Rewards (RLVR) has recently
demonstrated notable success in enhancing the reasoning performance of large
language models (LLMs), particularly on mathematics and programming tasks.
Similar to how traditional RL helps agents explore and learn new strategies,
RLVR is believed to enable LLMs to continuously self-improve, thus acquiring
novel reasoning abilities beyond those of the corresponding base models. In
this study we critically examine the current state of RLVR by systematically
probing the reasoning capability boundaries of RLVR-trained LLMs across various
model families, RL algorithms, and math, coding, and visual reasoning
benchmarks, using pass@k at large k values as the evaluation metric.
Surprisingly, we find that the current training setup does not elicit
fundamentally new reasoning patterns. While RLVR-trained models outperform
their base models at small k (e.g., k = 1), the base models achieve a higher
pass@k score when k is large. Coverage and perplexity analyses show that the
observed reasoning abilities originate from and are bounded by the base model.
Treating the base model as an upper bound, our quantitative analysis shows that
six popular RLVR algorithms perform similarly and remain far from optimal in
leveraging the potential of the base model. By contrast, we find that
distillation can introduce new reasoning patterns from the teacher and
genuinely expand the model's reasoning capabilities. Overall, our findings
suggest that current RLVR methods have not yet realized the potential of RL to
elicit truly novel reasoning abilities in LLMs. This highlights the need for
improved RL paradigms, such as continual scaling and multi-turn
agent-environment interaction, to unlock this potential.
comment: 30 pages, 27 figures
♻ ☆ Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons NeurIPS 2025
Large language models (LLMs) excel in various capabilities but pose safety
risks such as generating harmful content and misinformation, even after safety
alignment. In this paper, we explore the inner mechanisms of safety alignment
through the lens of mechanistic interpretability, focusing on identifying and
analyzing safety neurons within LLMs that are responsible for safety behaviors.
We propose inference-time activation contrasting to locate these neurons and
dynamic activation patching to evaluate their causal effects on model safety.
Experiments on multiple prevalent LLMs demonstrate that we can consistently
identify about $5\%$ safety neurons, and by only patching their activations we
can restore over $90\%$ of the safety performance across various red-teaming
benchmarks without influencing general ability. The finding of safety neurons
also helps explain the ''alignment tax'' phenomenon by revealing that the key
neurons for model safety and helpfulness significantly overlap, yet they
require different activation patterns for the same neurons. Furthermore, we
demonstrate an application of our findings in safeguarding LLMs by detecting
unsafe outputs before generation. The source code is available at
https://github.com/THU-KEG/SafetyNeuron.
comment: NeurIPS 2025
♻ ☆ Benchmarking GPT-5 for biomedical natural language processing
Biomedical literature and clinical narratives pose multifaceted challenges
for natural language understanding, from precise entity extraction and document
synthesis to multi-step diagnostic reasoning. This study extends a unified
benchmark to evaluate GPT-5 and GPT-4o under zero-, one-, and five-shot
prompting across five core biomedical NLP tasks: named entity recognition,
relation extraction, multi-label document classification, summarization, and
simplification, and nine expanded biomedical QA datasets covering factual
knowledge, clinical reasoning, and multimodal visual understanding. Using
standardized prompts, fixed decoding parameters, and consistent inference
pipelines, we assessed model performance, latency, and token-normalized cost
under official pricing. GPT-5 consistently outperformed GPT-4o, with the
largest gains on reasoning-intensive datasets such as MedXpertQA and
DiagnosisArena and stable improvements in multimodal QA. In core tasks, GPT-5
achieved better chemical NER and ChemProt scores but remained below
domain-tuned baselines for disease NER and summarization. Despite producing
longer outputs, GPT-5 showed comparable latency and 30 to 50 percent lower
effective cost per correct prediction. Fine-grained analyses revealed
improvements in diagnosis, treatment, and reasoning subtypes, whereas
boundary-sensitive extraction and evidence-dense summarization remain
challenging. Overall, GPT-5 approaches deployment-ready performance for
biomedical QA while offering a favorable balance of accuracy, interpretability,
and economic efficiency. The results support a tiered prompting strategy:
direct prompting for large-scale or cost-sensitive applications, and
chain-of-thought scaffolds for analytically complex or high-stakes scenarios,
highlighting the continued need for hybrid solutions where precision and
factual fidelity are critical.
♻ ☆ XtraGPT: Context-Aware and Controllable Academic Paper Revision
Nuo Chen, Andre Lin HuiKai, Jiaying Wu, Junyi Hou, Zining Zhang, Qian Wang, Xidong Wang, Bingsheng He
Despite the growing adoption of large language models (LLMs) in academic
workflows, their capabilities remain limited to support high-quality scientific
writing. Most existing systems are designed for general-purpose scientific text
generation and fail to meet the sophisticated demands of research communication
beyond surface-level polishing, such as conceptual coherence across sections.
Furthermore, academic writing is inherently iterative and revision-driven, a
process not well supported by direct prompting-based paradigms. To address
these scenarios, we propose a human-AI collaboration framework for academic
paper revision centered on criteria-guided intent alignment and context-aware
modeling. To validate the framework, we curate a dataset of 7,000 research
papers from top-tier venues annotated with 140,000 instruction-response pairs
that reflect realistic, section-level scientific revisions. We instantiate the
framework in XtraGPT, the first suite of open-source LLMs (1.5B to 14B
parameters) for context-aware, instruction-guided writing assistance. Extensive
experiments validate that XtraGPT significantly outperforms same-scale
baselines and approaches the quality of proprietary systems. Both automated
preference assessments and human evaluations confirm the effectiveness of
XtraGPT in improving scientific drafts.
comment: Preprint. The model report is available at
https://arxiv.org/abs/2505.11336v1
♻ ☆ Unlocking Multi-View Insights in Knowledge-Dense Retrieval-Augmented Generation
While Retrieval-Augmented Generation (RAG) plays a crucial role in the
application of Large Language Models (LLMs), existing retrieval methods in
knowledge-dense domains like law and medicine still suffer from a lack of
multi-perspective views, which are essential for improving interpretability and
reliability. Previous research on multi-view retrieval often focused solely on
different semantic forms of queries, neglecting the expression of specific
domain knowledge perspectives. This paper introduces a novel multi-view RAG
framework, MVRAG, tailored for knowledge-dense domains that utilizes
intention-aware query rewriting from multiple domain viewpoints to enhance
retrieval precision, thereby improving the effectiveness of the final
inference. Experiments conducted on legal and medical case retrieval
demonstrate significant improvements in recall and precision rates with our
framework. Our multi-perspective retrieval approach unleashes the potential of
multi-view information enhancing RAG tasks, accelerating the further
application of LLMs in knowledge-intensive fields.
♻ ☆ Neural Attention Search
We present Neural Attention Search (NAtS), a framework that automatically
evaluates the importance of each token within a sequence and determines if the
corresponding token can be dropped after several steps. This approach can
efficiently reduce the KV cache sizes required by transformer-based models
during inference and thus reduce inference costs. In this paper, we design a
search space that contains three token types: (i) Global Tokens will be
preserved and queried by all the following tokens. (ii) Local Tokens survive
until the next global token appears. (iii) Sliding Window Tokens have an impact
on the inference of a fixed size of the next following tokens. Similar to the
One-Shot Neural Architecture Search approach, this token-type information can
be learned jointly with the architecture weights via a learnable attention
mask. Experiments on both training a new transformer from scratch and
fine-tuning existing large language models show that NAtS can efficiently
reduce the KV cache size required for the models while maintaining the models'
performance.
comment: 35 pages, 11 figures
♻ ☆ Position: The Current AI Conference Model is Unsustainable! Diagnosing the Crisis of Centralized AI Conference
Artificial Intelligence (AI) conferences are essential for advancing
research, sharing knowledge, and fostering academic community. However, their
rapid expansion has rendered the centralized conference model increasingly
unsustainable. This paper offers a data-driven diagnosis of a structural crisis
that threatens the foundational goals of scientific dissemination, equity, and
community well-being. We identify four key areas of strain: (1) scientifically,
with per-author publication rates more than doubling over the past decade to
over 4.5 papers annually; (2) environmentally, with the carbon footprint of a
single conference exceeding the daily emissions of its host city; (3)
psychologically, with 71% of online community discourse reflecting negative
sentiment and 35% referencing mental health concerns; and (4) logistically,
with attendance at top conferences such as NeurIPS 2024 beginning to outpace
venue capacity. These pressures point to a system that is misaligned with its
core mission. In response, we propose the Community-Federated Conference (CFC)
model, which separates peer review, presentation, and networking into globally
coordinated but locally organized components, offering a more sustainable,
inclusive, and resilient path forward for AI research.
comment: Preprint
♻ ☆ Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders EMNLP 2025
Large language models (LLMs) are now ubiquitous in user-facing applications,
yet they still generate undesirable toxic outputs, including profanity,
vulgarity, and derogatory remarks. Although numerous detoxification methods
exist, most apply broad, surface-level fixes and can therefore easily be
circumvented by jailbreak attacks. In this paper we leverage sparse
autoencoders (SAEs) to identify toxicity-related directions in the residual
stream of models and perform targeted activation steering using the
corresponding decoder vectors. We introduce three tiers of steering
aggressiveness and evaluate them on GPT-2 Small and Gemma-2-2B, revealing
trade-offs between toxicity reduction and language fluency. At stronger
steering strengths, these causal interventions surpass competitive baselines in
reducing toxicity by up to 20%, though fluency can degrade noticeably on GPT-2
Small depending on the aggressiveness. Crucially, standard NLP benchmark scores
upon steering remain stable, indicating that the model's knowledge and general
abilities are preserved. We further show that feature-splitting in wider SAEs
hampers safety interventions, underscoring the importance of disentangled
feature learning. Our findings highlight both the promise and the current
limitations of SAE-based causal interventions for LLM detoxification, further
suggesting practical guidelines for safer language-model deployment.
comment: EMNLP 2025
♻ ☆ MoMoE: Mixture of Moderation Experts Framework for AI-Assisted Online Governance EMNLP 2025
Large language models (LLMs) have shown great potential in flagging harmful
content in online communities. Yet, existing approaches for moderation require
a separate model for every community and are opaque in their decision-making,
limiting real-world adoption. We introduce Mixture of Moderation Experts
(MoMoE), a modular, cross-community framework that adds post-hoc explanations
to scalable content moderation. MoMoE orchestrates four operators -- Allocate,
Predict, Aggregate, Explain -- and is instantiated as seven
community-specialized experts (MoMoE-Community) and five norm-violation experts
(MoMoE-NormVio). On 30 unseen subreddits, the best variants obtain Micro-F1
scores of 0.72 and 0.67, respectively, matching or surpassing strong fine-tuned
baselines while consistently producing concise and reliable explanations.
Although community-specialized experts deliver the highest peak accuracy,
norm-violation experts provide steadier performance across domains. These
findings show that MoMoE yields scalable, transparent moderation without
needing per-community fine-tuning. More broadly, they suggest that lightweight,
explainable expert ensembles can guide future NLP and HCI research on
trustworthy human-AI governance of online communities.
comment: EMNLP 2025 (Oral)
♻ ☆ MultiHal: Multilingual Dataset for Knowledge-Graph Grounded Evaluation of LLM Hallucinations
Large Language Models (LLMs) have inherent limitations of faithfulness and
factuality, commonly referred to as hallucinations. Several benchmarks have
been developed that provide a test bed for factuality evaluation within the
context of English-centric datasets, while relying on supplementary informative
context like web links or text passages but ignoring the available structured
factual resources. To this end, Knowledge Graphs (KGs) have been identified as
a useful aid for hallucination mitigation, as they provide a structured way to
represent the facts about entities and their relations with minimal linguistic
overhead. We bridge the lack of KG paths and multilinguality for factual
language modeling within the existing hallucination evaluation benchmarks and
propose a KG-based multilingual, multihop benchmark called MultiHal framed for
generative text evaluation. As part of our data collection pipeline, we mined
140k KG-paths from open-domain KGs, from which we pruned noisy KG-paths,
curating a high-quality subset of 25.9k. Our baseline evaluation shows an
absolute scale improvement by approximately 0.12 to 0.36 points for the
semantic similarity score, 0.16 to 0.36 for NLI entailment and 0.29 to 0.42 for
hallucination detection in KG-RAG over vanilla QA across multiple languages and
multiple models, demonstrating the potential of KG integration. We anticipate
MultiHal will foster future research towards several graph-based hallucination
mitigation and fact-checking tasks.
♻ ☆ Embodied Agents Meet Personalization: Investigating Challenges and Solutions Through the Lens of Memory Utilization
Taeyoon Kwon, Dongwook Choi, Hyojun Kim, Sunghwan Kim, Seungjun Moon, Beong-woo Kwak, Kuan-Hao Huang, Jinyoung Yeo
LLM-powered embodied agents have shown success on conventional
object-rearrangement tasks, but providing personalized assistance that
leverages user-specific knowledge from past interactions presents new
challenges. We investigate these challenges through the lens of agents' memory
utilization along two critical dimensions: object semantics (identifying
objects based on personal meaning) and user patterns (recalling sequences from
behavioral routines). To assess these capabilities, we construct MEMENTO, an
end-to-end two-stage evaluation framework comprising single-memory and
joint-memory tasks. Our experiments reveal that current agents can recall
simple object semantics but struggle to apply sequential user patterns to
planning. Through in-depth analysis, we identify two critical bottlenecks:
information overload and coordination failures when handling multiple memories.
Based on these findings, we explore memory architectural approaches to address
these challenges. Given our observation that episodic memory provides both
personalized knowledge and in-context learning benefits, we design a
hierarchical knowledge graph-based user-profile memory module that separately
manages personalized knowledge, achieving substantial improvements on both
single and joint-memory tasks. Project website:
https://connoriginal.github.io/MEMENTO
comment: Work in progress
♻ ☆ MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks
Sara Papi, Maike Züfle, Marco Gaido, Beatrice Savoldi, Danni Liu, Ioannis Douros, Luisa Bentivogli, Jan Niehues
Recent advances in large language models have catalyzed the development of
multimodal LLMs (MLLMs) that integrate text, speech, and vision within unified
frameworks. As MLLMs evolve from narrow, monolingual, task-specific systems to
general-purpose instruction-following models, a key frontier lies in evaluating
their multilingual and multimodal capabilities over both long and short
contexts. However, existing benchmarks fall short in evaluating these
dimensions jointly: they are often limited to English, mostly focus on one
single modality at a time, rely on short-form contexts, or lack human
annotations -- hindering comprehensive assessment of model performance across
languages, modalities, and task complexity. To address these gaps, we introduce
MCIF (Multimodal Crosslingual Instruction Following), the first multilingual
human-annotated benchmark based on scientific talks that is designed to
evaluate instruction-following in crosslingual, multimodal settings over both
short- and long-form inputs. MCIF spans three core modalities -- speech,
vision, and text -- and four diverse languages (English, German, Italian, and
Chinese), enabling a comprehensive evaluation of MLLMs' abilities to interpret
instructions across languages and combine them with multimodal contextual
information. MCIF is released under a CC-BY 4.0 license to encourage open
research and progress in MLLMs development.
comment: Data available at https://huggingface.co/datasets/FBK-MT/MCIF |
Evaluation and baselines available at https://github.com/hlt-mt/mcif
♻ ☆ Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants NeurIPS 2025
Lixiong Qin, Shilong Ou, Miaoxuan Zhang, Jiangning Wei, Yuhang Zhang, Xiaoshuai Song, Yuchen Liu, Mei Wang, Weiran Xu
Faces and humans are crucial elements in social interaction and are widely
included in everyday photos and videos. Therefore, a deep understanding of
faces and humans will enable multi-modal assistants to achieve improved
response quality and broadened application scope. Currently, the multi-modal
assistant community lacks a comprehensive and scientific evaluation of face and
human understanding abilities. In this paper, we first propose a hierarchical
ability taxonomy that includes three levels of abilities. Then, based on this
taxonomy, we collect images and annotations from publicly available datasets in
the face and human community and build a semi-automatic data pipeline to
produce problems for the new benchmark. Finally, the obtained Face-Human-Bench
includes a development set and a test set, each with 1800 problems, supporting
both English and Chinese. We conduct evaluations over 25 mainstream multi-modal
large language models (MLLMs) with our Face-Human-Bench, focusing on the
correlation between abilities, the impact of the relative position of targets
on performance, and the impact of Chain of Thought (CoT) prompting on
performance. We also explore which abilities of MLLMs need to be supplemented
by specialist models. The dataset and evaluation code have been made publicly
available at https://face-human-bench.github.io.
comment: 50 pages, 14 figures, 42 tables. NeurIPS 2025 Datasets and Benchmarks
Track
♻ ☆ Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification
As large language models (LLMs) become increasingly prevalent in global
applications, ensuring that they are toxicity-free across diverse linguistic
contexts remains a critical challenge. We explore "Cross-lingual
Detoxification", a cross-lingual paradigm that mitigates toxicity, enabling
detoxification capabilities to transfer between high and low-resource languages
across different script families. We analyze cross-lingual detoxification's
effectiveness through 392 extensive settings to evaluate toxicity reduction in
cross-distribution settings with limited data and investigate how mitigation
impacts model performance on non-toxic tasks, revealing trade-offs between
safety and knowledge preservation. Our code and dataset are publicly available
at https://github.com/himanshubeniwal/Breaking-mBad.
comment: Accepted at MELT Workshop @ COLM 2025
♻ ☆ Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models
Large Language Models (LLMs) have shown strong abilities in general language
tasks, yet adapting them to specific domains remains a challenge. Current
method like Domain Adaptive Pretraining (DAPT) requires costly full-parameter
training and suffers from catastrophic forgetting. Meanwhile,
Retrieval-Augmented Generation (RAG) introduces substantial inference latency
due to expensive nearest-neighbor searches and longer context. This paper
introduces Memory Decoder, a plug-and-play pretrained memory that enables
efficient domain adaptation without changing the original model's parameters.
Memory Decoder employs a small transformer decoder that learns to imitate the
behavior of an external non-parametric retriever. Once trained, Memory Decoder
can be seamlessly integrated with any pretrained language model that shares the
same tokenizer, requiring no model-specific modifications. Experimental results
demonstrate that Memory Decoder enables effective adaptation of various Qwen
and Llama models to three distinct specialized domains: biomedicine, finance,
and law, reducing perplexity by an average of 6.17 points. Overall, Memory
Decoder introduces a novel paradigm centered on a specially pretrained memory
component designed for domain-specific adaptation. This memory architecture can
be integrated in a plug-and-play manner, consistently enhancing performance
across multiple models within the target domain.
♻ ☆ Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where
generated responses seem semantically plausible yet exhibit little or no
relevance to the input image. Previous studies reveal that this issue primarily
stems from LVLMs' over-reliance on language priors while disregarding the
visual information during decoding. To alleviate this issue, we introduce a
novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding
strategy, which adaptively strengthens the mutual dependency between generated
texts and input images to mitigate hallucinations. Unlike existing methods
solely focusing on text token sampling, we propose to jointly model the
contributions of visual and textual tokens to C-PMI, formulating hallucination
mitigation as a bi-level optimization problem aimed at maximizing mutual
information. To solve it, we design a token purification mechanism that
dynamically regulates the decoding process by sampling text tokens remaining
maximally relevant to the given image, while simultaneously refining image
tokens most pertinent to the generated response. Extensive experiments across
various benchmarks reveal that the proposed method significantly reduces
hallucinations in LVLMs while preserving decoding efficiency.
♻ ☆ MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Cultural Learning NeurIPS 2025
Embodied agents powered by large language models (LLMs), such as Voyager,
promise open-ended competence in worlds such as Minecraft. However, when
powered by open-weight LLMs they still falter on elementary tasks after
domain-specific fine-tuning. We propose MindForge, a generative-agent framework
for cultural lifelong learning through explicit perspective taking. We
introduce three key innovations: (1) a structured theory of mind representation
linking percepts, beliefs, desires, and actions; (2) natural inter-agent
communication; and (3) a multi-component memory system. Following the cultural
learning framework, we test MindForge in both instructive and collaborative
settings within Minecraft. In an instructive setting with GPT-4, MindForge
agents powered by open-weight LLMs significantly outperform their Voyager
counterparts in basic tasks yielding $3\times$ more tech-tree milestones and
collecting $2.3\times$ more unique items than the Voyager baseline.
Furthermore, in fully \textit{collaborative} settings, we find that the
performance of two underachieving agents improves with more communication
rounds, echoing the Condorcet Jury Theorem. MindForge agents demonstrate
sophisticated behaviors, including expert-novice knowledge transfer,
collaborative problem solving, and adaptation to out-of-distribution tasks
through accumulated cultural experiences.
comment: Accepted to NeurIPS 2025 main track as poster
♻ ☆ HauntAttack: When Attack Follows Reasoning as a Shadow
Emerging Large Reasoning Models (LRMs) consistently excel in mathematical and
reasoning tasks, showcasing remarkable capabilities. However, the enhancement
of reasoning abilities and the exposure of internal reasoning processes
introduce new safety vulnerabilities. A critical question arises: when
reasoning becomes intertwined with harmfulness, will LRMs become more
vulnerable to jailbreaks in reasoning mode? To investigate this, we introduce
HauntAttack, a novel and general-purpose black-box adversarial attack framework
that systematically embeds harmful instructions into reasoning questions.
Specifically, we modify key reasoning conditions in existing questions with
harmful instructions, thereby constructing a reasoning pathway that guides the
model step by step toward unsafe outputs. We evaluate HauntAttack on 11 LRMs
and observe an average attack success rate of 70\%, achieving up to 12
percentage points of absolute improvement over the strongest prior baseline.
Our further analysis reveals that even advanced safety-aligned models remain
highly susceptible to reasoning-based attacks, offering insights into the
urgent challenge of balancing reasoning capability and safety in future model
development.
♻ ☆ SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment NeurIPS 2025
Large Reasoning Models (LRMs) have become powerful tools for complex problem
solving, but their structured reasoning pathways can lead to unsafe outputs
when exposed to harmful prompts. Existing safety alignment methods reduce
harmful outputs but can degrade reasoning depth, leading to significant
trade-offs in complex, multi-step tasks, and remain vulnerable to sophisticated
jailbreak attacks. To address this, we introduce SAFEPATH, a lightweight
alignment method that fine-tunes LRMs to emit a short, 8-token Safety Primer at
the start of their reasoning, in response to harmful prompts, while leaving the
rest of the reasoning process unsupervised. Empirical results across multiple
benchmarks indicate that SAFEPATH effectively reduces harmful outputs while
maintaining reasoning performance. Specifically, SAFEPATH reduces harmful
responses by up to 90.0% and blocks 83.3% of jailbreak attempts in the
DeepSeek-R1-Distill-Llama-8B model, while requiring 295.9x less compute than
Direct Refusal and 314.1x less than SafeChain. We further introduce a zero-shot
variant that requires no fine-tuning. In addition, we provide a comprehensive
analysis of how existing methods in LLMs generalize, or fail, when applied to
reasoning-centric models, revealing critical gaps and new directions for safer
AI.
comment: Accepted at NeurIPS 2025. Code and models are available at
https://ai-isl.github.io/safepath
♻ ☆ Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning NeurIPS 2025
Chaofan Lin, Jiaming Tang, Shuo Yang, Hanshuo Wang, Tian Tang, Boyu Tian, Ion Stoica, Song Han, Mingyu Gao
Leveraging attention sparsity to accelerate long-context large language
models (LLMs) has been a hot research topic. However, current algorithms such
as sparse attention or key-value (KV) cache compression tend to use a fixed
budget, which presents a significant challenge during deployment because it
fails to account for the dynamic nature of real-world scenarios, where the
optimal balance between accuracy and efficiency can vary greatly. In this
paper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparse
attention can surprisingly achieve adaptive budgeting. Based on this, we
propose Twilight, a framework to bring adaptive sparsity to any existing sparse
attention algorithm without sacrificing their accuracy. Empirical results show
that Twilight can adaptively prune at most 98% of redundant tokens, leading to
$15.4\times$ acceleration in self-attention operations and $3.9\times$
acceleration in end-to-end per token latency in long context LLM decoding.
comment: To appear on NeurIPS 2025 (spotlight)
♻ ☆ S-DAT: A Multilingual, GenAI-Driven Framework for Automated Divergent Thinking Assessment
This paper introduces S-DAT (Synthetic-Divergent Association Task), a
scalable, multilingual framework for automated assessment of divergent thinking
(DT) -a core component of human creativity. Traditional creativity assessments
are often labor-intensive, language-specific, and reliant on subjective human
ratings, limiting their scalability and cross-cultural applicability. In
contrast, S-DAT leverages large language models and advanced multilingual
embeddings to compute semantic distance -- a language-agnostic proxy for DT. We
evaluate S-DAT across eleven diverse languages, including English, Spanish,
German, Russian, Hindi, and Japanese (Kanji, Hiragana, Katakana), demonstrating
robust and consistent scoring across linguistic contexts. Unlike prior DAT
approaches, the S-DAT shows convergent validity with other DT measures and
correct discriminant validity with convergent thinking. This cross-linguistic
flexibility allows for more inclusive, global-scale creativity research,
addressing key limitations of earlier approaches. S-DAT provides a powerful
tool for fairer, more comprehensive evaluation of cognitive flexibility in
diverse populations and can be freely assessed online:
https://sdat.iol.zib.de/.
♻ ☆ ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding NeurIPS 2025
Speculative decoding is a widely adopted technique for accelerating inference
in large language models (LLMs), yet its application to vision-language models
(VLMs) remains underexplored, with existing methods achieving only modest
speedups (<1.5x). This gap is increasingly significant as multimodal
capabilities become central to large-scale models. We hypothesize that large
VLMs can effectively filter redundant image information layer by layer without
compromising textual comprehension, whereas smaller draft models struggle to do
so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a
novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor
module to compress image tokens into a compact representation, which is
seamlessly integrated into the draft model's attention mechanism while
preserving original image positional information. Additionally, we extract a
global feature vector for each input image and augment all subsequent text
tokens with this feature to enhance multimodal coherence. To overcome the
scarcity of multimodal datasets with long assistant responses, we curate a
specialized training dataset by repurposing existing datasets and generating
extended outputs using the target VLM with modified prompts. Our training
strategy mitigates the risk of the draft model exploiting direct access to the
target model's hidden states, which could otherwise lead to shortcut learning
when training solely on target model outputs. Extensive experiments validate
ViSpec, achieving, to our knowledge, the first substantial speedup in VLM
speculative decoding. Code is available at
https://github.com/KangJialiang/ViSpec.
comment: NeurIPS 2025
♻ ☆ Adapting Multilingual Models to Code-Mixed Tasks via Model Merging
Prashant Kodali, Vaishnavi Shivkumar, Swarang Joshi, Monojit Choudhary, Ponnurangam Kumaraguru, Manish Shrivastava
We study model merging as a practical alternative to conventional adaptation
strategies for code-mixed NLP. Starting from a multilingual base model, we: (i)
perform continued pre-training (CPT) on unlabeled code-mixed text to obtain an
adapted checkpoint, (ii) merge checkpoint with the base model, and (iii)
fine-tune (FT) on the downstream task data. We evaluate our approach for
sentence classification (sentiment and hate speech) task in English-Hindi
(En-Hi) and English-Spanish (En-Es) using XLM-R and Llama-3.2-1B models. Our
results show that merged models consistently outperform full fine-tuning and
CPT->FT. We observe gains of 2--5 points in F1 over full fine-tuning and ~1-2
points over CPT->FT, indicating that unlabeled data is leveraged more
effectively via merging than via CPT alone. Zero-/few-shot prompting with
larger LLMs (e.g., Llama-3.3-70B) lags behind fine-tuned and merged
checkpoints, underscoring limits of in-context learning for code-mixed inputs.
We further test cross-pair transfer by training on En-Hi and evaluating on
En-Ta and En-Ml: merged checkpoints transfer more strongly than
monolingual-English baselines (e.g., TV/TIES variants reaching 0.65-0.68 F1 vs
0.61-0.63 for full fine-tuning), suggesting that code-mixed knowledge is a more
reliable substrate for low-resource pairs. We conclude with adaptation recipes
matched to common data regimes (labeled only; labeled+unlabeled; transfer-only)
and discuss limitations and scaling considerations for broader tasks and larger
models.
comment: 9 pages, 5 tables, CODS 2025
♻ ☆ Less is More: Compact Clue Selection for Efficient Retrieval-Augmented Generation Reasoning
Current RAG retrievers are designed primarily for human readers, emphasizing
complete, readable, and coherent paragraphs. However, LLMs benefit more from
precise, compact, and well-structured input, which enhances reasoning quality
and efficiency. Existing methods often rely on reranking or summarization to
identify key sentences, but may suffer from semantic breaks and unfaithfulness.
Thus, efficiently extracting and organizing answer-relevant clues from
large-scale documents while reducing LLM reasoning costs remains a challenge
for RAG. Inspired by Occam's razor, we frame LLM-centric retrieval as a MinMax
optimization: maximizing the extraction of potential clues and reranking them
for well-organization, while minimizing reasoning costs by truncating to the
smallest sufficient clues set. In this paper, we propose CompSelect, a Compact
clue Selection mechanism for LLM-centric RAG, consisting of a clue extractor, a
reranker, and a truncator. (1) The clue extractor first uses answer-containing
sentences as fine-tuning targets, aiming to extract sufficient potential clues;
(2) The reranker is trained to prioritize effective clues based on real LLM
feedback; (3) The truncator uses the truncated text containing the minimum
sufficient clues for answering the question as fine-tuning targets, thereby
enabling efficient RAG reasoning. Experiments on three QA datasets show that
CompSelect improves QA performance by approximately 11\% and reduces Total
Latency and Online Latency by approximately 17\% and 67\% compared to various
baseline methods on both LLaMA3 and Qwen3. Further analysis confirms its
robustness to unreliable retrieval and generalization across different
scenarios, offering a scalable and cost-efficient solution for web-scale RAG
applications.
comment: 12 pages, 7 figures, 12 tables, under review
♻ ☆ Bi-Mamba: Towards Accurate 1-Bit State Space Models
The typical Selective State-Space Model (SSM) used in Mamba addresses several
limitations of Transformers, such as the quadratic computational complexity
with respect to sequence length and the significant memory requirements during
inference due to the key-value (KV) cache. However, the increasing size of
Mamba models continues to pose challenges for training and deployment,
particularly due to their substantial computational demands during both
training and inference. In this work, we introduce $\texttt{Bi-Mamba}$, a
scalable and powerful 1-bit Mamba architecture designed to enable more
efficient large language models (LLMs), with model sizes of 780M, 1.3B, and
2.7B parameters. $\texttt{Bi-Mamba}$ models are trained from scratch on a
standard LLM-scale dataset using an autoregressive distillation loss. Extensive
experiments on language modeling benchmarks demonstrate that
$\texttt{Bi-Mamba}$ achieves performance comparable to its full-precision (FP16
or BF16) counterparts, while outperforming post-training binarization (PTB)
Mamba and binarization-aware training (BAT) Transformer baselines. Moreover,
$\texttt{Bi-Mamba}$ drastically reduces memory usage and computational cost
compared to the original Mamba. Our work pioneers a new line of
linear-complexity LLMs under low-bit representation and provides the way for
the design of specialized hardware optimized for efficient 1-bit Mamba-based
models. Code and the pre-trained weights are available at
https://github.com/Tangshengku/Bi-Mamba.
comment: Accepted in TMLR 2025
♻ ☆ Zhyper: Factorized Hypernetworks for Conditioned LLM Fine-Tuning
Large Language Model (LLM) conditioning refers to instructing an LLM to
generate content in accordance with the norms and values of a specific culture,
beliefs of a particular political orientation, or any desired text-specified
semantic conditioning. Unfortunately, prompt engineering does not ensure that
LLMs behave in accordance with a desired conditioning due to the inductive bias
of the pre-training and alignment datasets. Prior works have focused on
fine-tuning LLMs by directly conditioning the LoRA weights; however, such
methods introduce a large number of parameters. As a remedy, we propose Zhyper,
a parameter-efficient factorized hypernetwork framework that generates
context-aware LoRA adapters from textual descriptions. Experiments on multiple
benchmarks show that Zhyper achieves competitive performance with up to 26x
fewer parameters than the state-of-the-art baselines. Furthermore, we extend
Zhyper to cultural alignment, demonstrating improved generalization to
out-of-domain settings and a better capturing of fine-grained contextual
values.
♻ ☆ MLMA: Towards Multilingual ASR With Mamba-based Architectures ICASSP 2026
Multilingual automatic speech recognition (ASR) remains a challenging task,
especially when balancing performance across high- and low-resource languages.
Recent advances in sequence modeling suggest that architectures beyond
Transformers may offer better scalability and efficiency. In this work, we
introduce MLMA (Multilingual Language Modeling with Mamba for ASR), a new
approach that leverages the Mamba architecture -- an efficient state-space
model optimized for long-context sequence processing -- for multilingual ASR.
Using Mamba, MLMA implicitly incorporates language-aware conditioning and
shared representations to support robust recognition across diverse languages.
Experiments on standard multilingual benchmarks show that MLMA achieves
competitive performance compared to Transformer-based architectures. These
results highlight Mamba's potential as a strong backbone for scalable,
efficient, and accurate multilingual speech recognition.
comment: The paper is under review at ICASSP 2026
♻ ☆ Accelerating Mobile Language Model via Speculative Decoding and NPU-Coordinated Execution
Enhancing on-device large language models (LLMs) with contextual information
from local data enables personalized and task-aware generation, powering use
cases such as intelligent assistants and UI agents. While recent developments
in neural processors have substantially improved the efficiency of prefill on
mobile devices, the token-by-token generation process still suffers from high
latency and limited hardware utilization due to its inherently memory-bound
characteristics. This work presents sd.npu, a mobile inference framework that
integrates speculative decoding with dynamic hardware scheduling to accelerate
context-aware text generation on mobile devices. The framework introduces three
synergistic components: (1) adaptive execution scheduling, which dynamically
balances compute graphs between prefill and decoding phases; (2)
context-aligned drafting, which improves speculative efficiency through
lightweight online calibration to current tasks; and (3) hardware-efficient
draft extension, which reuses and expands intermediate sequences to improve
processing parallelism and reduce verification cost. Experiments on multiple
smartphones and representative workloads show consistent improvements of up to
3.8x in generation speed and 4.7x in energy efficiency compared with existing
mobile inference solutions. Component-level analysis further validates the
contribution of each optimization.
♻ ☆ Diagnosing Representation Dynamics in NER Model Extension
Extending Named Entity Recognition (NER) models to new PII entities in noisy
spoken-language data is a common need. We find that jointly fine-tuning a BERT
model on standard semantic entities (PER, LOC, ORG) and new pattern-based PII
(EMAIL, PHONE) results in minimal degradation for original classes. We
investigate this "peaceful coexistence," hypothesizing that the model uses
independent semantic vs. morphological feature mechanisms.
Using an incremental learning setup as a diagnostic tool, we measure semantic
drift and find two key insights. First, the LOC (location) entity is uniquely
vulnerable due to a representation overlap with new PII, as it shares
pattern-like features (e.g., postal codes). Second, we identify a "reverse
O-tag representation drift." The model, initially trained to map PII patterns
to 'O', blocks new learning. This is resolved only by unfreezing the 'O' tag's
classifier, allowing the background class to adapt and "release" these
patterns. This work provides a mechanistic diagnosis of NER model adaptation,
highlighting feature independence, representation overlap, and 'O' tag
plasticity. Work done based on data gathered by https://www.papernest.com
♻ ☆ More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG
Retrieval-Augmented Generation (RAG) enhances the accuracy of Large Language
Model (LLM) responses by leveraging relevant external documents during
generation. Although previous studies noted that retrieving many documents can
degrade performance, they did not isolate how the quantity of documents affects
performance while controlling for context length. We evaluate various language
models on custom datasets derived from a multi-hop QA task. We keep the context
length and position of relevant information constant while varying the number
of documents, and find that increasing the document count in RAG settings poses
significant challenges for most LLMs, reducing performance by up to 20%.
However, Qwen2.5 maintained consistent results across increasing document
counts, indicating better multi-document handling capability. Finally, our
results indicate that processing multiple documents is a separate challenge
from handling long contexts. We also make the datasets and code available:
https://github.com/shaharl6000/MoreDocsSameLen .
comment: Preprint
♻ ☆ A New Benchmark Dataset and Mixture-of-Experts Language Models for Adversarial Natural Language Inference in Vietnamese
Existing Vietnamese Natural Language Inference (NLI) datasets lack
adversarial complexity, limiting their ability to evaluate model robustness
against challenging linguistic phenomena. In this article, we address the gap
in robust Vietnamese NLI resources by introducing ViANLI, the first adversarial
NLI dataset for Vietnamese, and propose NLIMoE, a Mixture-of-Experts model to
tackle its complexity. We construct ViANLI using an adversarial
human-and-machine-in-the-loop approach with rigorous verification. NLIMoE
integrates expert subnetworks with a learned dynamic routing mechanism on top
of a shared transformer encoder. ViANLI comprises over 10,000
premise-hypothesis pairs and challenges state-of-the-art models, with XLM-R
Large achieving only 45.5% accuracy, while NLIMoE reaches 47.3%. Training with
ViANLI improves performance on other benchmark Vietnamese NLI datasets
including ViNLI, VLSP2021-NLI, and VnNewsNLI. ViANLI is released for enhancing
research into model robustness and enriching resources for future Vietnamese
and multilingual NLI research.
comment: Accepted by Expert Systems with Applications
♻ ☆ Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities NeurIPS 2025
Mayank Jobanputra, Yana Veitsman, Yash Sarrof, Aleksandra Bakalova, Vera Demberg, Ellie Pavlick, Michael Hahn
Transformers have theoretical limitations in modeling certain
sequence-to-sequence tasks, yet it remains largely unclear if these limitations
play a role in large-scale pretrained LLMs, or whether LLMs might effectively
overcome these constraints in practice due to the scale of both the models
themselves and their pretraining data. We explore how these architectural
constraints manifest after pretraining, by studying a family of
$\textit{retrieval}$ and $\textit{copying}$ tasks inspired by Liu et al.
[2024a]. We use a recently proposed framework for studying length
generalization [Huang et al., 2025] to provide guarantees for each of our
settings. Empirically, we observe an $\textit{induction-versus-anti-induction}$
asymmetry, where pretrained models are better at retrieving tokens to the right
(induction) rather than the left (anti-induction) of a query token. This
asymmetry disappears upon targeted fine-tuning if length-generalization is
guaranteed by theory. Mechanistic analysis reveals that this asymmetry is
connected to the differences in the strength of induction versus anti-induction
circuits within pretrained transformers. We validate our findings through
practical experiments on real-world tasks demonstrating reliability risks. Our
results highlight that pretraining selectively enhances certain transformer
capabilities, but does not overcome fundamental length-generalization limits.
comment: NeurIPS 2025
♻ ☆ "You Are Rejected!": An Empirical Study of Large Language Models Taking Hiring Evaluations
With the proliferation of the internet and the rapid advancement of
Artificial Intelligence, leading technology companies face an urgent annual
demand for a considerable number of software and algorithm engineers. To
efficiently and effectively identify high-potential candidates from thousands
of applicants, these firms have established a multi-stage selection process,
which crucially includes a standardized hiring evaluation designed to assess
job-specific competencies. Motivated by the demonstrated prowess of Large
Language Models (LLMs) in coding and reasoning tasks, this paper investigates a
critical question: Can LLMs successfully pass these hiring evaluations? To this
end, we conduct a comprehensive examination of a widely used professional
assessment questionnaire. We employ state-of-the-art LLMs to generate responses
and subsequently evaluate their performance. Contrary to any prior expectation
of LLMs being ideal engineers, our analysis reveals a significant inconsistency
between the model-generated answers and the company-referenced solutions. Our
empirical findings lead to a striking conclusion: All evaluated LLMs fails to
pass the hiring evaluation.
comment: Technical Report, 14 pages, 8 figures
♻ ☆ Stress-Testing Model Specs Reveals Character Differences among Language Models
Large language models (LLMs) are increasingly trained from AI constitutions
and model specifications that establish behavioral guidelines and ethical
principles. However, these specifications face critical challenges, including
internal conflicts between principles and insufficient coverage of nuanced
scenarios. We present a systematic methodology for stress-testing model
character specifications, automatically identifying numerous cases of principle
contradictions and interpretive ambiguities in current model specs.
We stress test current model specs by generating scenarios that force
explicit tradeoffs between competing value-based principles. Using a
comprehensive taxonomy we generate diverse value tradeoff scenarios where
models must choose between pairs of legitimate principles that cannot be
simultaneously satisfied. We evaluate responses from twelve frontier LLMs
across major providers (Anthropic, OpenAI, Google, xAI) and measure behavioral
disagreement through value classification scores. Among these scenarios, we
identify over 70,000 cases exhibiting significant behavioral divergence.
Empirically, we show this high divergence in model behavior strongly predicts
underlying problems in model specifications. Through qualitative analysis, we
provide numerous example issues in current model specs such as direct
contradiction and interpretive ambiguities of several principles. Additionally,
our generated dataset also reveals both clear misalignment cases and
false-positive refusals across all of the frontier models we study. Lastly, we
also provide value prioritization patterns and differences of these models.
♻ ☆ LFD: Layer Fused Decoding to Exploit External Knowledge in Retrieval-Augmented Generation
Yang Sun, Zhiyong Xie, Dan Luo, Long Zhang, Liming Dong, Yunwei Zhao, Xixun Lin, Yanxiong Lu, Chenliang Li, Lixin Zou
Retrieval-augmented generation (RAG) incorporates external knowledge into
large language models (LLMs), improving their adaptability to downstream tasks
and enabling information updates. Surprisingly, recent empirical evidence
demonstrates that injecting noise into retrieved relevant documents
paradoxically facilitates exploitation of external knowledge and improves
generation quality. Although counterintuitive and challenging to apply in
practice, this phenomenon enables granular control and rigorous analysis of how
LLMs integrate external knowledge. Therefore, in this paper, we intervene on
noise injection and establish a layer-specific functional demarcation within
the LLM: shallow layers specialize in local context modeling, intermediate
layers focus on integrating long-range external factual knowledge, and deeper
layers primarily rely on parametric internal knowledge. Building on this
insight, we propose Layer Fused Decoding (LFD), a simple decoding strategy that
directly combines representations from an intermediate layer with final-layer
decoding outputs to fully exploit the external factual knowledge. To identify
the optimal intermediate layer, we introduce an internal knowledge score (IKS)
criterion that selects the layer with the lowest IKS value in the latter half
of layers. Experimental results across multiple benchmarks demonstrate that LFD
helps RAG systems more effectively surface retrieved context knowledge with
minimal cost.
♻ ☆ Multilingual LLM Prompting Strategies for Medical English-Vietnamese Machine Translation
Medical English-Vietnamese machine translation (En-Vi MT) is essential for
healthcare access and communication in Vietnam, yet Vietnamese remains a
low-resource and under-studied language. We systematically evaluate prompting
strategies for six multilingual LLMs (0.5B-9B parameters) on the MedEV dataset,
comparing zero-shot, few-shot, and dictionary-augmented prompting with Meddict,
an English-Vietnamese medical lexicon. Results show that model scale is the
primary driver of performance: larger LLMs achieve strong zero-shot results,
while few-shot prompting yields only marginal improvements. In contrast,
terminology-aware cues and embedding-based example retrieval consistently
improve domain-specific translation. These findings underscore both the promise
and the current limitations of multilingual LLMs for medical En-Vi MT.
comment: This version has been withdrawn after receiving the conference review
results. We are currently extending and reorganizing the work into a new
study
♻ ☆ ixi-GEN: Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining EMNLP 2025
Seonwu Kim, Yohan Na, Kihun Kim, Hanhee Cho, Geun Lim, Mintae Kim, Seongik Park, Ki Hyun Kim, Youngsub Han, Byoung-Ki Jeon
The emergence of open-source large language models (LLMs) has expanded
opportunities for enterprise applications; however, many organizations still
lack the infrastructure to deploy and maintain large-scale models. As a result,
small LLMs (sLLMs) have become a practical alternative despite inherent
performance limitations. While Domain Adaptive Continual Pretraining (DACP) has
been explored for domain adaptation, its utility in commercial settings remains
under-examined. In this study, we validate the effectiveness of a DACP-based
recipe across diverse foundation models and service domains, producing
DACP-applied sLLMs (ixi-GEN). Through extensive experiments and real-world
evaluations, we demonstrate that ixi-GEN models achieve substantial gains in
target-domain performance while preserving general capabilities, offering a
cost-efficient and scalable solution for enterprise-level deployment.
comment: Accepted at EMNLP 2025 Industry Track
♻ ☆ Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning
Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, Meng Li, Mingyang Zhang, Peijie Jiang, Peng Jiao, Qian Zhao, Qingyuan Yang, Wenbo Shen, Xinxing Yang, Yalin Zhang, Yankun Ren, Yao Zhao, Yibo Cao, Yixuan Sun, Yue Zhang, Yuchen Fang, Zibin Lin, Zixuan Cheng, Jun Zhou
In this technical report, we present the Ring-linear model series,
specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0.
Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while
Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both
models adopt a hybrid architecture that effectively integrates linear attention
and softmax attention, significantly reducing I/O and computational overhead in
long-context inference scenarios. Compared to a 32 billion parameter dense
model, this series reduces inference cost to 1/10, and compared to the original
Ring series, the cost is also reduced by over 50%. Furthermore, through
systematic exploration of the ratio between different attention mechanisms in
the hybrid architecture, we have identified the currently optimal model
structure. Additionally, by leveraging our self-developed high-performance FP8
operator library-linghe, overall training efficiency has been improved by 50%.
Benefiting from the high alignment between the training and inference engine
operators, the models can undergo long-term, stable, and highly efficient
optimization during the reinforcement learning phase, consistently maintaining
SOTA performance across multiple challenging complex reasoning benchmarks.
comment: 20 pages, 13 figures
♻ ☆ Toward Metaphor-Fluid Conversation Design for Voice User Interfaces
Metaphors play a critical role in shaping user experiences with Voice User
Interfaces (VUIs), yet existing designs often rely on static, human-centric
metaphors that fail to adapt to diverse contexts and user needs. This paper
introduces Metaphor-Fluid Design, a novel approach that dynamically adjusts
metaphorical representations based on conversational use-contexts. We compare
this approach to a Default VUI, which characterizes the present implementation
of commercial VUIs commonly designed around the persona of an assistant,
offering a uniform interaction style across contexts. In Study 1 (N=130),
metaphors were mapped to four key use-contexts-commands, information seeking,
sociality, and error recovery-along the dimensions of formality and hierarchy,
revealing distinct preferences for task-specific metaphorical designs. Study 2
(N=91) evaluates a Metaphor-Fluid VUI against a Default VUI, showing that the
Metaphor-Fluid VUI enhances perceived intention to adopt, enjoyment, and
likability by aligning better with user expectations for different contexts.
However, individual differences in metaphor preferences highlight the need for
personalization. These findings challenge the one-size-fits-all paradigm of VUI
design and demonstrate the potential of Metaphor-Fluid Design to create more
adaptive and engaging human-AI interactions.
♻ ☆ TianHui: A Domain-Specific Large Language Model for Diverse Traditional Chinese Medicine Scenarios
Domain-specific LLMs in TCM face limitations in research settings due to
constrained adaptability, insufficient evaluation datasets, and limited
computational resources. This study presents TianHui, a specialized TCM LLM
built through contextual data integration and domain knowledge fusion. We
constructed a large-scale TCM corpus (0.97GB unsupervised data + 611,312 QA
pairs) and employed a two-stage training strategy with QLoRA, DeepSpeed Stage
2, and Flash Attention 2. Evaluation on 12 benchmarks showed TianHui ranked
top-three in all metrics for six datasets (APQ, TCMCD, HFR, HCCA, DHPE, TLAW)
and achieved top results in the other six (TCMEE, APR, GCPMI, TCMKQA, TCMRC,
ADTG). Optimal configuration was identified as LoRA rank=128, alpha=256,
epoch=4, dropout=0.2, max length=2048. TianHui enables systematic preservation
and scalable application of TCM knowledge. All resources are open-sourced.
comment: 46 pages, 5 figures,3 tables
♻ ☆ Does Thinking More always Help? Mirage of Test-Time Scaling in Reasoning Models NeurIPS 2025
Soumya Suvra Ghosal, Souradip Chakraborty, Avinash Reddy, Yifu Lu, Mengdi Wang, Dinesh Manocha, Furong Huang, Mohammad Ghavamzadeh, Amrit Singh Bedi
Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1,
DeepSeek R1) have led to a popular belief that extending thinking traces using
prompts like "Wait" or "Let me rethink" can improve performance. This raises a
natural question: Does thinking more at test-time truly lead to better
reasoning? To answer this question, we perform a detailed empirical study
across models and benchmarks, which reveals a consistent pattern of initial
performance improvements from additional thinking followed by a decline, due to
"overthinking". To understand this non-monotonic trend, we consider a simple
probabilistic model, which reveals that additional thinking increases output
variance-creating an illusion of improved reasoning while ultimately
undermining precision. Thus, observed gains from "more thinking" are not true
indicators of improved reasoning, but artifacts stemming from the connection
between model uncertainty and evaluation metric. This suggests that test-time
scaling through extended thinking is not an effective way to utilize the
inference thinking budget. Recognizing these limitations, we introduce an
alternative test-time scaling approach, parallel thinking, inspired by
Best-of-N sampling. Our method generates multiple independent reasoning paths
within the same inference budget and selects the most consistent response via
majority vote, achieving up to 20% higher accuracy compared to extended
thinking. This provides a simple yet effective mechanism for test-time scaling
of reasoning models.
comment: Accepted at NeurIPS 2025
♻ ☆ LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation
Weikang Yuan, Kaisong Song, Zhuoren Jiang, Junjie Cao, Yujie Zhang, Jun Lin, Kun Kuang, Ji Zhang, Xiaozhong Liu
Legal consultation is essential for safeguarding individual rights and
ensuring access to justice, yet remains costly and inaccessible to many
individuals due to the shortage of professionals. While recent advances in
Large Language Models (LLMs) offer a promising path toward scalable, low-cost
legal assistance, current systems fall short in handling the interactive and
knowledge-intensive nature of real-world consultations. To address these
challenges, we introduce LeCoDe, a real-world multi-turn benchmark dataset
comprising 3,696 legal consultation dialogues with 110,008 dialogue turns,
designed to evaluate and improve LLMs' legal consultation capability. With
LeCoDe, we innovatively collect live-streamed consultations from short-video
platforms, providing authentic multi-turn legal consultation dialogues. The
rigorous annotation by legal experts further enhances the dataset with
professional insights and expertise. Furthermore, we propose a comprehensive
evaluation framework that assesses LLMs' consultation capabilities in terms of
(1) clarification capability and (2) professional advice quality. This unified
framework incorporates 12 metrics across two dimensions. Through extensive
experiments on various general and domain-specific LLMs, our results reveal
significant challenges in this task, with even state-of-the-art models like
GPT-4 achieving only 39.8% recall for clarification and 59% overall score for
advice quality, highlighting the complexity of professional consultation
scenarios. Based on these findings, we further explore several strategies to
enhance LLMs' legal consultation abilities. Our benchmark contributes to
advancing research in legal domain dialogue systems, particularly in simulating
more real-world user-expert interactions.
♻ ☆ MLP Memory: A Retriever-Pretrained Memory for Large Language Models
Modern approaches to enhancing Large Language Models' factual accuracy and
knowledge utilization face a fundamental trade-off: non-parametric
retrieval-augmented generation (RAG) provides flexible access to external
knowledge but suffers from high inference latency and shallow integration,
while parametric fine-tuning methods like LoRA risk catastrophic forgetting and
degraded general capabilities. In this work, we propose MLP Memory, a
lightweight parametric module that learns to internalize retrieval patterns
without explicit document access. By pretraining an MLP to imitate a $k$NN
retriever's behavior on the entire pretraining dataset, we create a
differentiable memory component that captures the benefits of retrieval-based
knowledge access in a fully parametric form. Our architecture integrates this
pretrained MLP Memory with Transformer decoders through simple probability
interpolation, yielding 17.5\% and 24.1\% scaling gains on WikiText-103 and Web
datasets, respectively. It further achieves 12.3\% relative improvement on five
question-answering benchmarks and 5.2 points absolute gain across nine general
NLP tasks, while reducing hallucinations by up to 10 points on HaluEval.
Moreover, MLP Memory delivers 2.5$\times$ faster inference than RAG with
superior accuracy. Our findings show that learning retrieval patterns
parametrically bridges the gap between efficient inference and effective
knowledge access, offering a practical alternative to both RAG and fine-tuning
approaches.
♻ ☆ Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models? NeurIPS 2025
Multi-Agent Debate~(MAD) has emerged as a promising paradigm for improving
the performance of large language models through collaborative reasoning.
Despite recent advances, the key factors driving MAD's effectiveness remain
unclear. In this work, we disentangle MAD into two key components--Majority
Voting and inter-agent Debate--and assess their respective contributions.
Through extensive experiments across seven NLP benchmarks, we find that
Majority Voting alone accounts for most of the performance gains typically
attributed to MAD. To explain this, we propose a theoretical framework that
models debate as a stochastic process. We prove that it induces a martingale
over agents' belief trajectories, implying that debate alone does not improve
expected correctness. Guided by these insights, we demonstrate that targeted
interventions, by biasing the belief update toward correction, can meaningfully
enhance debate effectiveness. Overall, our findings suggest that while MAD has
potential, simple ensembling methods remain strong and more reliable
alternatives in many practical settings. Code is released in
https://github.com/deeplearning-wisc/debate-or-vote.
comment: NeurIPS 2025 Spotlight
♻ ☆ Text to Band Gap: Pre-trained Language Models as Encoders for Semiconductor Band Gap Prediction
We investigate transformer-based language models, including RoBERTa, T5,
Llama-3, and MatSciBERT, for predicting the band gaps of semiconductor
materials directly from textual descriptions. The inputs encode key material
features, such as chemical composition, crystal system, space group, and other
structural and electronic properties. Unlike shallow machine learning models,
which require extensive feature engineering, or Graph Neural Networks, which
rely on graph representations derived from atomic coordinates, pretrained
language models can process textual inputs directly, eliminating the need for
manual feature preprocessing or structure-based encoding. Material descriptions
were constructed in two formats: structured strings with a consistent template
and natural language narratives generated via the ChatGPT API. Each model was
augmented with a custom regression head and finetuned for band gap prediction
task. Language models of different architectures and parameter sizes were all
able to predict band gaps from human-readable text with strong accuracy,
achieving MAEs in the range of 0.25-0.33 eV, highlighting the success of this
approach for scientific regression tasks. Finetuned Llama-3, with 1.2 billion
parameters, achieved the highest accuracy (MAE 0.248 eV, R2 0.891). MatSciBERT,
pretrained on materials science literature, reached comparable performance (MAE
0.288 eV, R2 0.871) with significantly fewer parameters (110 million),
emphasizing the importance of domain-specific pretraining. Attention analysis
shows that both models selectively focus on compositional and spin-related
features while de-emphasizing geometric features, reflecting the difficulty of
capturing spatial information from text. These results establish that
pretrained language models can effectively extract complex feature-property
relationships from textual material descriptions.
♻ ☆ LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration AAAI 2025
Building a universal multilingual automatic speech recognition (ASR) model
that performs equitably across languages has long been a challenge due to its
inherent difficulties. To address this task we introduce a Language-Agnostic
Multilingual ASR pipeline through orthography Unification and language-specific
Transliteration (LAMA-UT). LAMA-UT operates without any language-specific
modules while matching the performance of state-of-the-art models trained on a
minimal amount of data. Our pipeline consists of two key steps. First, we
utilize a universal transcription generator to unify orthographic features into
Romanized form and capture common phonetic characteristics across diverse
languages. Second, we utilize a universal converter to transform these
universal transcriptions into language-specific ones. In experiments, we
demonstrate the effectiveness of our proposed method leveraging universal
transcriptions for massively multilingual ASR. Our pipeline achieves a relative
error reduction rate of 45% when compared to Whisper and performs comparably to
MMS, despite being trained on only 0.1% of Whisper's training data.
Furthermore, our pipeline does not rely on any language-specific modules.
However, it performs on par with zero-shot ASR approaches which utilize
additional language-specific lexicons and language models. We expect this
framework to serve as a cornerstone for flexible multilingual ASR systems that
are generalizable even to unseen languages.
comment: Accepted to AAAI 2025 (Oral Presentation)
♻ ☆ KAT-Coder Technical Report
Zizheng Zhan, Ken Deng, Xiaojiang Zhang, Jinghui Wang, Huaixi Tang, Zhiyi Lai, Haoyang Huang, Wen Xiang, Kun Wu, Wenhao Zhuang, Minglei Zhang, Shaojie Wang, Shangpeng Yan, Kepeng Lei, Zongxian Feng, Huiming Wang, Zheng Lin, Mengtong Li, Mengfei Xie, Yinghan Cui, Xuxing Chen, Chao Wang, Weihao Li, Wenqiang Zhu, Jiarong Zhang, Jingxuan Xu, Songwei Yu, Yifan Yao, Xinping Lei, C. Zhang, Han Li, Junqi Xiong, Zuchen Gao, Dailin Li, Haimo Li, Jiaheng Liu, Yuqun Zhang, Junyi Peng, Haotian Zhang, Bin Chen
Recent advances in large language models (LLMs) have enabled progress in
agentic coding, where models autonomously reason, plan, and act within
interactive software development workflows. However, bridging the gap between
static text-based training and dynamic real-world agentic execution remains a
core challenge. In this technical report, we present KAT-Coder, a large-scale
agentic code model trained through a multi-stage curriculum encompassing
Mid-Term Training, Supervised Fine-Tuning (SFT), Reinforcement Fine-Tuning
(RFT), and Reinforcement-to-Deployment Adaptation. The Mid-Term stage enhances
reasoning, planning, and reflection capabilities through a corpus of real
software engineering data and synthetic agentic interactions. The SFT stage
constructs a million-sample dataset balancing twenty programming languages, ten
development contexts, and ten task archetypes. The RFT stage introduces a novel
multi-ground-truth reward formulation for stable and sample-efficient policy
optimization. Finally, the Reinforcement-to-Deployment phase adapts the model
to production-grade IDE environments using Error-Masked SFT and Tree-Structured
Trajectory Training. In summary, these stages enable KAT-Coder to achieve
robust tool-use reliability, instruction alignment, and long-context reasoning,
forming a deployable foundation for real-world intelligent coding agents. Our
KAT series 32B model, KAT-Dev, has been open-sourced on
https://huggingface.co/Kwaipilot/KAT-Dev.
♻ ☆ Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards
Youliang Yuan, Qiuyang Mang, Jingbang Chen, Hong Wan, Xiaoyuan Liu, Junjielong Xu, Jen-tse Huang, Wenxuan Wang, Wenxiang Jiao, Pinjia He
Large language models for mathematical reasoning are typically trained with
outcome-based rewards, which credit only the final answer. In our experiments,
we observe that this paradigm is highly susceptible to reward hacking, leading
to a substantial overestimation of a model's reasoning ability. This is
evidenced by a high incidence of false positives - solutions that reach the
correct final answer through an unsound reasoning process. Through a systematic
analysis with human verification, we establish a taxonomy of these failure
modes, identifying patterns like Miracle Steps - abrupt jumps to a correct
output without a valid preceding derivation. Probing experiments suggest a
strong association between these Miracle Steps and memorization, where the
model appears to recall the answer directly rather than deriving it. To
mitigate this systemic issue, we introduce the Rubric Reward Model (RRM), a
process-oriented reward function that evaluates the entire reasoning trajectory
against problem-specific rubrics. The generative RRM provides fine-grained,
calibrated rewards (0-1) that explicitly penalize logical flaws and encourage
rigorous deduction. When integrated into a reinforcement learning pipeline,
RRM-based training consistently outperforms outcome-only supervision across
four math benchmarks. Notably, it boosts Verified Pass@1024 on AIME2024 from
26.7% to 62.6% and reduces the incidence of Miracle Steps by 71%. Our work
demonstrates that rewarding the solution process is crucial for building models
that are not only more accurate but also more reliable.
comment: 25 pages, 11 figures, 6 Tables
♻ ☆ A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System
Jiale Guo, Suizhi Huang, Mei Li, Dong Huang, Xingsheng Chen, Regina Zhang, Zhijiang Guo, Han Yu, Siu-Ming Yiu, Pietro Lio, Kwok-Yan Lam
The integration of Large Language Models (LLMs) into software engineering has
driven a transition from traditional rule-based systems to autonomous agentic
systems capable of solving complex problems. However, systematic progress is
hindered by a lack of comprehensive understanding of how benchmarks and
solutions interconnect. This survey addresses this gap by providing the first
holistic analysis of LLM-powered software engineering, offering insights into
evaluation methodologies and solution paradigms. We review over 150 recent
papers and propose a taxonomy along two key dimensions: (1) Solutions,
categorized into prompt-based, fine-tuning-based, and agent-based paradigms,
and (2) Benchmarks, including tasks such as code generation, translation, and
repair. Our analysis highlights the evolution from simple prompt engineering to
sophisticated agentic systems incorporating capabilities like planning,
reasoning, memory mechanisms, and tool augmentation. To contextualize this
progress, we present a unified pipeline illustrating the workflow from task
specification to deliverables, detailing how different solution paradigms
address various complexity levels. Unlike prior surveys that focus narrowly on
specific aspects, this work connects 50+ benchmarks to their corresponding
solution strategies, enabling researchers to identify optimal approaches for
diverse evaluation criteria. We also identify critical research gaps and
propose future directions, including multi-agent collaboration, self-evolving
systems, and formal verification integration. This survey serves as a
foundational guide for advancing LLM-driven software engineering. We maintain a
GitHub repository that continuously updates the reviewed and related papers at
https://github.com/lisaGuojl/LLM-Agent-SE-Survey.
comment: 22 pages
♻ ☆ MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning? NeurIPS 2025
The ability to recognize patterns from examples and apply them to new ones is
a primal ability for general intelligence, and is widely studied by psychology
and AI researchers. Many benchmarks have been proposed to measure such ability
for Large Language Models (LLMs); however, they focus on few-shot (usually <10)
setting and lack evaluation for aggregating many pieces of information from
long contexts. On the other hand, the ever-growing context length of LLMs have
brought forth the novel paradigm of many-shot In-Context Learning (ICL), which
addresses new tasks with hundreds to thousands of examples without expensive
and inefficient fine-tuning. However, many-shot evaluations often focus on
classification, and popular long-context LLM tasks such as Needle-In-A-Haystack
(NIAH) seldom require complicated intelligence for integrating many pieces of
information. To fix the issues from both worlds, we propose MIR-Bench, the
first many-shot in-context reasoning benchmark for pattern recognition that
asks LLM to predict output via input-output examples from underlying functions
with diverse data format. Based on MIR-Bench, we study many novel problems for
many-shot in-context reasoning, and acquired many insightful findings including
scaling effect, robustness, inductive vs. transductive reasoning, retrieval
Augmented Generation (RAG), coding for inductive reasoning, cross-domain
generalizability, etc.
comment: 39 pages, 11 figures. The paper is accepted at NeurIPS 2025 Datasets
& Benchmarks Track, and the latest version adds modifications in camera-ready
♻ ☆ Sherlock: Self-Correcting Reasoning in Vision-Language Models NeurIPS 2025
Reasoning Vision-Language Models (VLMs) have shown promising performance on
complex multimodal tasks. However, they still face significant challenges: they
are highly sensitive to reasoning errors, require large volumes of annotated
data or accurate verifiers, and struggle to generalize beyond specific domains.
To address these limitations, we explore self-correction as a strategy to
enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning
VLMs' self-correction abilities and identify key gaps. Based on our findings,
we introduce Sherlock, a self-correction and self-improvement training
framework. Sherlock introduces a trajectory-level self-correction objective, a
preference data construction method based on visual perturbation, and a dynamic
$\beta$ for preference tuning. Once the model acquires self-correction
capabilities using only 20k randomly sampled annotated data, it continues to
self-improve without external supervision. Built on the Llama3.2-Vision-11B
model, Sherlock achieves remarkable results across eight benchmarks, reaching
an average accuracy of 64.1 with direct generation and 65.4 after
self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and
LlamaV-o1 (63.4) while using less than 20% of the annotated data.
comment: Published at NeurIPS 2025, 27 pages
♻ ☆ RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning NeurIPS 2025
Reinforcement learning (RL) has recently emerged as a compelling approach for
enhancing the reasoning capabilities of large language models (LLMs), where an
LLM generator serves as a policy guided by a verifier (reward model). However,
current RL post-training methods for LLMs typically use verifiers that are
fixed (rule-based or frozen pretrained) or trained discriminatively via
supervised fine-tuning (SFT). Such designs are susceptible to reward hacking
and generalize poorly beyond their training distributions. To overcome these
limitations, we propose Tango, a novel framework that uses RL to concurrently
train both an LLM generator and a verifier in an interleaved manner. A central
innovation of Tango is its generative, process-level LLM verifier, which is
trained via RL and co-evolves with the generator. Importantly, the verifier is
trained solely based on outcome-level verification correctness rewards without
requiring explicit process-level annotations. This generative RL-trained
verifier exhibits improved robustness and superior generalization compared to
deterministic or SFT-trained verifiers, fostering effective mutual
reinforcement with the generator. Extensive experiments demonstrate that both
components of Tango achieve state-of-the-art results among 7B/8B-scale models:
the generator attains best-in-class performance across five competition-level
math benchmarks and four challenging out-of-domain reasoning tasks, while the
verifier leads on the ProcessBench dataset. Remarkably, both components exhibit
particularly substantial improvements on the most difficult mathematical
reasoning problems. Code is at: https://github.com/kaiwenzha/rl-tango.
comment: NeurIPS 2025. The first two authors contributed equally
♻ ☆ Quantitative LLM Judges
Aishwarya Sahoo, Jeevana Kruthi Karnuthala, Tushar Parmanand Budhwani, Pranchal Agarwal, Sankaran Vaidyanathan, Alexa Siu, Franck Dernoncourt, Jennifer Healey, Nedim Lipka, Ryan Rossi, Uttaran Bhattacharya, Branislav Kveton
LLM-as-a-judge is a framework where a large language model (LLM) evaluates
the output of another LLM. While LLMs excel at producing qualitative textual
evaluations, they often struggle to predict human preferences and numeric
scores. We propose quantitative LLM judges, which align evaluation scores of
existing LLM judges to humans in a given domain using regression models. The
models are trained to improve the score of the original judge using its
rationale and score. We present four quantitative judges for different types of
absolute and relative feedback, which showcases the generality and versatility
of our framework. Our framework is more computationally efficient than
supervised fine-tuning and can be more statistically efficient when human
feedback is limited, which is expected in practice. We validate these claims
empirically on four datasets using two base judges. Our experiments show that
quantitative judges can improve the predictive power of existing judges through
post-hoc modeling.
♻ ☆ Annotation Guidelines-Based Knowledge Augmentation: Towards Enhancing Large Language Models for Educational Text Classification
Various machine learning approaches have gained significant popularity for
the automated classification of educational text to identify indicators of
learning engagement -- i.e. learning engagement classification (LEC). LEC can
offer comprehensive insights into human learning processes, attracting
significant interest from diverse research communities, including Natural
Language Processing (NLP), Learning Analytics, and Educational Data Mining.
Recently, Large Language Models (LLMs), such as ChatGPT, have demonstrated
remarkable performance in various NLP tasks. However, their comprehensive
evaluation and improvement approaches in LEC tasks have not been thoroughly
investigated. In this study, we propose the Annotation Guidelines-based
Knowledge Augmentation (AGKA) approach to improve LLMs. AGKA employs GPT 4.0 to
retrieve label definition knowledge from annotation guidelines, and then
applies the random under-sampler to select a few typical examples.
Subsequently, we conduct a systematic evaluation benchmark of LEC, which
includes six LEC datasets covering behavior classification (question and
urgency level), emotion classification (binary and epistemic emotion), and
cognition classification (opinion and cognitive presence). The study results
demonstrate that AGKA can enhance non-fine-tuned LLMs, particularly GPT 4.0 and
Llama 3 70B. GPT 4.0 with AGKA few-shot outperforms full-shot fine-tuned models
such as BERT and RoBERTa on simple binary classification datasets. However, GPT
4.0 lags in multi-class tasks that require a deep understanding of complex
semantic information. Notably, Llama 3 70B with AGKA is a promising combination
based on open-source LLM, because its performance is on par with closed-source
GPT 4.0 with AGKA. In addition, LLMs struggle to distinguish between labels
with similar names in multi-class classification.
comment: The manuscript has been accepted for publication in IEEE Transactions
on Learning Technologies. https://doi.org/10.1109/TLT.2025.3570775
♻ ☆ Hybrid Latent Reasoning via Reinforcement Learning NeurIPS 2025
Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, Dong Wang
Recent advances in large language models (LLMs) have introduced latent
reasoning as a promising alternative to autoregressive reasoning. By performing
internal computation with hidden states from previous steps, latent reasoning
benefit from more informative features rather than sampling a discrete
chain-of-thought (CoT) path. Yet latent reasoning approaches are often
incompatible with LLMs, as their continuous paradigm conflicts with the
discrete nature of autoregressive generation. Moreover, these methods rely on
CoT traces for training and thus fail to exploit the inherent reasoning
patterns of LLMs. In this work, we explore latent reasoning by leveraging the
intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we
introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid
latent reasoning approach that (1) integrates prior hidden states into sampled
tokens with a learnable gating mechanism, and (2) initializes training with
predominantly token embeddings while progressively incorporating more hidden
features. This design maintains LLMs' generative capabilities and incentivizes
hybrid reasoning using both discrete and continuous representations. In
addition, the hybrid HRPO introduces stochasticity into latent reasoning via
token sampling, thereby enabling RL-based optimization without requiring CoT
trajectories. Extensive evaluations across diverse benchmarks show that HRPO
outperforms prior methods in both knowledge- and reasoning-intensive tasks.
Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing
behaviors like cross-lingual patterns and shorter completion lengths,
highlighting the potential of our RL-based approach and offer insights for
future work in latent reasoning.
comment: NeurIPS 2025
♻ ☆ PersonaMatrix: A Recipe for Persona-Aware Evaluation of Legal Summarization
Legal documents are often long, dense, and difficult to comprehend, not only
for laypeople but also for legal experts. While automated document
summarization has great potential to improve access to legal knowledge,
prevailing task-based evaluators overlook divergent user and stakeholder needs.
Tool development is needed to encompass the technicality of a case summary for
a litigator yet be accessible for a self-help public researching for their
lawsuit. We introduce PersonaMatrix, a persona-by-criterion evaluation
framework that scores summaries through the lens of six personas, including
legal and non-legal users. We also introduce a controlled dimension-shifted
pilot dataset of U.S. civil rights case summaries that varies along depth,
accessibility, and procedural detail as well as Diversity-Coverage Index (DCI)
to expose divergent optima of legal summary between persona-aware and
persona-agnostic judges. This work enables refinement of legal AI summarization
systems for both expert and non-expert users, with the potential to increase
access to legal knowledge. The code base and data are publicly available in
GitHub.
comment: Accepted for publication in JURIX 2025 (Legal Knowledge and
Information Systems, FAIA series, IOS Press). Long Paper
♻ ☆ AssistedDS: Benchmarking How External Domain Knowledge Assists LLMs in Automated Data Science
An Luo, Xun Xian, Jin Du, Fangqiao Tian, Ganghua Wang, Ming Zhong, Shengchun Zhao, Xuan Bi, Zirui Liu, Jiawei Zhou, Jayanth Srinivasa, Ashish Kundu, Charles Fleming, Mingyi Hong, Jie Ding
Large language models (LLMs) have advanced the automation of data science
workflows. Yet it remains unclear whether they can critically leverage external
domain knowledge as human data scientists do in practice. To answer this
question, we introduce AssistedDS (Assisted Data Science), a benchmark designed
to systematically evaluate how LLMs handle domain knowledge in tabular
prediction tasks. AssistedDS features both synthetic datasets with explicitly
known generative mechanisms and real-world Kaggle competitions, each
accompanied by curated bundles of helpful and adversarial documents. These
documents provide domain-specific insights into data cleaning, feature
engineering, and model selection. We assess state-of-the-art LLMs on their
ability to discern and apply beneficial versus harmful domain knowledge,
evaluating submission validity, information recall, and predictive performance.
Our results demonstrate three key findings: (1) LLMs frequently exhibit an
uncritical adoption of provided information, significantly impairing their
predictive performance when adversarial content is introduced, (2) helpful
guidance is often insufficient to counteract the negative influence of
adversarial information, and (3) in Kaggle datasets, LLMs often make errors in
handling time-series data, applying consistent feature engineering across
different folds, and interpreting categorical variables correctly. These
findings highlight a substantial gap in current models' ability to critically
evaluate and leverage expert knowledge, underscoring an essential research
direction for developing more robust, knowledge-aware automated data science
systems. Our data and code are publicly available here:
https://github.com/jeremyxianx/Assisted-DS
♻ ☆ RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing
Hao Xiang, Tianyi Tang, Yang Su, Bowen Yu, An Yang, Fei Huang, Yichang Zhang, Yaojie Lu, Hongyu Lin, Xianpei Han, Jingren Zhou, Junyang Lin, Le Sun
Recent advancements in Large Language Models (LLMs) have shown outstanding
potential for role-playing applications. Evaluating these capabilities is
becoming crucial yet remains challenging. Existing benchmarks mostly adopt a
\textbf{character-centric} approach, simplify user-character interactions to
isolated Q&A tasks, and fail to reflect real-world applications. To address
this limitation, we introduce RMTBench, a comprehensive \textbf{user-centric}
bilingual role-playing benchmark featuring 80 diverse characters and over 8,000
dialogue rounds. RMTBench includes custom characters with detailed backgrounds
and abstract characters defined by simple traits, enabling evaluation across
various user scenarios. Our benchmark constructs dialogues based on explicit
user motivations rather than character descriptions, ensuring alignment with
practical user applications. Furthermore, we construct an authentic multi-turn
dialogue simulation mechanism. With carefully selected evaluation dimensions
and LLM-based scoring, this mechanism captures the complex intention of
conversations between the user and the character. By shifting focus from
character background to user intention fulfillment, RMTBench bridges the gap
between academic evaluation and practical deployment requirements, offering a
more effective framework for assessing role-playing capabilities in LLMs. All
code and datasets will be released soon. We release the datasets at
https://huggingface.co/datasets/xiangh/RMTBENCH.
♻ ☆ Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities NAACL 2025
Chung-En Sun, Xiaodong Liu, Weiwei Yang, Tsui-Wei Weng, Hao Cheng, Aidan San, Michel Galley, Jianfeng Gao
Recent research has shown that Large Language Models (LLMs) are vulnerable to
automated jailbreak attacks, where adversarial suffixes crafted by algorithms
appended to harmful queries bypass safety alignment and trigger unintended
responses. Current methods for generating these suffixes are computationally
expensive and have low Attack Success Rates (ASR), especially against
well-aligned models like Llama2 and Llama3. To overcome these limitations, we
introduce ADV-LLM, an iterative self-tuning process that crafts adversarial
LLMs with enhanced jailbreak ability. Our framework significantly reduces the
computational cost of generating adversarial suffixes while achieving nearly
100\% ASR on various open-source LLMs. Moreover, it exhibits strong attack
transferability to closed-source models, achieving 99\% ASR on GPT-3.5 and 49\%
ASR on GPT-4, despite being optimized solely on Llama3. Beyond improving
jailbreak ability, ADV-LLM provides valuable insights for future safety
alignment research through its ability to generate large datasets for studying
LLM safety. Our code is available at: https://github.com/SunChungEn/ADV-LLM
comment: Accepted to NAACL 2025 Main (Oral)
♻ ☆ Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning ICLR2025
Key-Value (KV) caching is a common technique to enhance the computational
efficiency of Large Language Models (LLMs), but its memory overhead grows
rapidly with input length. Prior work has shown that not all tokens are equally
important for text generation, proposing layer-level KV cache compression to
selectively retain key information. Recognizing the distinct roles of attention
heads in generation, we propose HeadKV, a head-level KV cache compression
method, and HeadKV-R2, which leverages a novel contextual reasoning ability
estimation for compression. Our approach operates at the level of individual
heads, estimating their importance for contextual QA tasks that require both
retrieval and reasoning capabilities. Extensive experiments across diverse
benchmarks (LongBench, LooGLE), model architectures (e.g., Llama-3-8B-Instruct,
Mistral-7B-Instruct), and long-context abilities tests demonstrate that our
head-level KV cache compression significantly outperforms strong baselines,
particularly in low-resource settings (KV size = 64 & 128). Notably, our method
retains just 1.5% of the KV cache while achieving 97% of the performance of the
full KV cache on the contextual question answering benchmark. Codes are
available at https://github.com/FYYFU/HeadKV
comment: Accepted to ICLR2025
♻ ☆ Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models EMNLP 2025
Yeonjun In, Wonjoong Kim, Kanghoon Yoon, Sungchul Kim, Mehrab Tanjim, Sangwu Park, Kibum Kim, Chanyoung Park
As the use of large language model (LLM) agents continues to grow, their
safety vulnerabilities have become increasingly evident. Extensive benchmarks
evaluate various aspects of LLM safety by defining the safety relying heavily
on general standards, overlooking user-specific standards. However, safety
standards for LLM may vary based on a user-specific profiles rather than being
universally consistent across all users. This raises a critical research
question: Do LLM agents act safely when considering user-specific safety
standards? Despite its importance for safe LLM use, no benchmark datasets
currently exist to evaluate the user-specific safety of LLMs. To address this
gap, we introduce U-SafeBench, a benchmark designed to assess user-specific
aspect of LLM safety. Our evaluation of 20 widely used LLMs reveals current
LLMs fail to act safely when considering user-specific safety standards,
marking a new discovery in this field. To address this vulnerability, we
propose a simple remedy based on chain-of-thought, demonstrating its
effectiveness in improving user-specific safety. Our benchmark and code are
available at https://github.com/yeonjun-in/U-SafeBench.
comment: EMNLP 2025 Findings