Computation and Language
☆ Grounded Misunderstandings in Asymmetric Dialogue: A Perspectivist Annotation Scheme for MapTask
Collaborative dialogue relies on participants incrementally establishing
common ground, yet in asymmetric settings they may believe they agree while
referring to different entities. We introduce a perspectivist annotation scheme
for the HCRC MapTask corpus (Anderson et al., 1991) that separately captures
speaker and addressee grounded interpretations for each reference expression,
enabling us to trace how understanding emerges, diverges, and repairs over
time. Using a scheme-constrained LLM annotation pipeline, we obtain 13k
annotated reference expressions with reliability estimates and analyze the
resulting understanding states. The results show that full misunderstandings
are rare once lexical variants are unified, but multiplicity discrepancies
systematically induce divergences, revealing how apparent grounding can mask
referential misalignment. Our framework provides both a resource and an
analytic lens for studying grounded misunderstanding and for evaluating
(V)LLMs' capacity to model perspective-dependent grounding in collaborative
dialogue.
comment: 11 pages, 3 figures, 5 tables; under review
☆ Do Androids Dream of Unseen Puppeteers? Probing for a Conspiracy Mindset in Large Language Models
In this paper, we investigate whether Large Language Models (LLMs) exhibit
conspiratorial tendencies, whether they display sociodemographic biases in this
domain, and how easily they can be conditioned into adopting conspiratorial
perspectives. Conspiracy beliefs play a central role in the spread of
misinformation and in shaping distrust toward institutions, making them a
critical testbed for evaluating the social fidelity of LLMs. LLMs are
increasingly used as proxies for studying human behavior, yet little is known
about whether they reproduce higher-order psychological constructs such as a
conspiratorial mindset. To bridge this research gap, we administer validated
psychometric surveys measuring conspiracy mindset to multiple models under
different prompting and conditioning strategies. Our findings reveal that LLMs
show partial agreement with elements of conspiracy belief, and conditioning
with socio-demographic attributes produces uneven effects, exposing latent
demographic biases. Moreover, targeted prompts can easily shift model responses
toward conspiratorial directions, underscoring both the susceptibility of LLMs
to manipulation and the potential risks of their deployment in sensitive
contexts. These results highlight the importance of critically evaluating the
psychological dimensions embedded in LLMs, both to advance computational social
science and to inform possible mitigation strategies against harmful uses.
☆ ChiMDQA: Towards Comprehensive Chinese Document QA with Fine-grained Evaluation ICANN 2025
With the rapid advancement of natural language processing (NLP) technologies,
the demand for high-quality Chinese document question-answering datasets is
steadily growing. To address this issue, we present the Chinese Multi-Document
Question Answering Dataset(ChiMDQA), specifically designed for downstream
business scenarios across prevalent domains including academic, education,
finance, law, medical treatment, and news. ChiMDQA encompasses long-form
documents from six distinct fields, consisting of 6,068 rigorously curated,
high-quality question-answer (QA) pairs further classified into ten
fine-grained categories. Through meticulous document screening and a systematic
question-design methodology, the dataset guarantees both diversity and high
quality, rendering it applicable to various NLP tasks such as document
comprehension, knowledge extraction, and intelligent QA systems. Additionally,
this paper offers a comprehensive overview of the dataset's design objectives,
construction methodologies, and fine-grained evaluation system, supplying a
substantial foundation for future research and practical applications in
Chinese QA. The code and data are available at:
https://anonymous.4open.science/r/Foxit-CHiMDQA/.
comment: 13 pages, 6 tables, 4 figures, accepted by ICANN 2025
☆ Watermarking Large Language Models in Europe: Interpreting the AI Act in Light of Technology
To foster trustworthy Artificial Intelligence (AI) within the European Union,
the AI Act requires providers to mark and detect the outputs of their
general-purpose models. The Article 50 and Recital 133 call for marking methods
that are ''sufficiently reliable, interoperable, effective and robust''. Yet,
the rapidly evolving and heterogeneous landscape of watermarks for Large
Language Models (LLMs) makes it difficult to determine how these four standards
can be translated into concrete and measurable evaluations. Our paper addresses
this challenge, anchoring the normativity of European requirements in the
multiplicity of watermarking techniques. Introducing clear and distinct
concepts on LLM watermarking, our contribution is threefold. (1) Watermarking
Categorisation: We propose an accessible taxonomy of watermarking methods
according to the stage of the LLM lifecycle at which they are applied - before,
during, or after training, and during next-token distribution or sampling. (2)
Watermarking Evaluation: We interpret the EU AI Act's requirements by mapping
each criterion with state-of-the-art evaluations on robustness and
detectability of the watermark, and of quality of the LLM. Since
interoperability remains largely untheorised in LLM watermarking research, we
propose three normative dimensions to frame its assessment. (3) Watermarking
Comparison: We compare current watermarking methods for LLMs against the
operationalised European criteria and show that no approach yet satisfies all
four standards. Encouraged by emerging empirical tests, we recommend further
research into watermarking directly embedded within the low-level architecture
of LLMs.
comment: 17 pages, 2 Tables and 2 Pictures
☆ Towards Transparent Stance Detection: A Zero-Shot Approach Using Implicit and Explicit Interpretability AAAI
Zero-Shot Stance Detection (ZSSD) identifies the attitude of the post toward
unseen targets. Existing research using contrastive, meta-learning, or data
augmentation suffers from generalizability issues or lack of coherence between
text and target. Recent works leveraging large language models (LLMs) for ZSSD
focus either on improving unseen target-specific knowledge or generating
explanations for stance analysis. However, most of these works are limited by
their over-reliance on explicit reasoning, provide coarse explanations that
lack nuance, and do not explicitly model the reasoning process, making it
difficult to interpret the model's predictions. To address these issues, in our
study, we develop a novel interpretable ZSSD framework, IRIS. We provide an
interpretable understanding of the attitude of the input towards the target
implicitly based on sequences within the text (implicit rationales) and
explicitly based on linguistic measures (explicit rationales). IRIS considers
stance detection as an information retrieval ranking task, understanding the
relevance of implicit rationales for different stances to guide the model
towards correct predictions without requiring the ground-truth of rationales,
thus providing inherent interpretability. In addition, explicit rationales
based on communicative features help decode the emotional and cognitive
dimensions of stance, offering an interpretable understanding of the author's
attitude towards the given target. Extensive experiments on the benchmark
datasets of VAST, EZ-STANCE, P-Stance, and RFD using 50%, 30%, and even 10%
training data prove the generalizability of our model, benefiting from the
proposed architecture and interpretable design.
comment: Accepted in AAAI CONFERENCE ON WEB AND SOCIAL MEDIA (ICWSM 2026)
☆ LiveTradeBench: Seeking Real-World Alpha with Large Language Models
Large language models (LLMs) achieve strong performance across
benchmarks--from knowledge quizzes and math reasoning to web-agent tasks--but
these tests occur in static settings, lacking real dynamics and uncertainty.
Consequently, they evaluate isolated reasoning or problem-solving rather than
decision-making under uncertainty. To address this, we introduce
LiveTradeBench, a live trading environment for evaluating LLM agents in
realistic and evolving markets. LiveTradeBench follows three design principles:
(i) Live data streaming of market prices and news, eliminating dependence on
offline backtesting and preventing information leakage while capturing
real-time uncertainty; (ii) a portfolio-management abstraction that extends
control from single-asset actions to multi-asset allocation, integrating risk
management and cross-asset reasoning; and (iii) multi-market evaluation across
structurally distinct environments--U.S. stocks and Polymarket prediction
markets--differing in volatility, liquidity, and information flow. At each
step, an agent observes prices, news, and its portfolio, then outputs
percentage allocations that balance risk and return. Using LiveTradeBench, we
run 50-day live evaluations of 21 LLMs across families. Results show that (1)
high LMArena scores do not imply superior trading outcomes; (2) models display
distinct portfolio styles reflecting risk appetite and reasoning dynamics; and
(3) some LLMs effectively leverage live signals to adapt decisions. These
findings expose a gap between static evaluation and real-world competence,
motivating benchmarks that test sequential decision making and consistency
under live uncertainty.
comment: 16 pages
☆ A systematic review of relation extraction task since the emergence of Transformers
This article presents a systematic review of relation extraction (RE)
research since the advent of Transformer-based models. Using an automated
framework to collect and annotate publications, we analyze 34 surveys, 64
datasets, and 104 models published between 2019 and 2024. The review highlights
methodological advances, benchmark resources, and the integration of semantic
web technologies. By consolidating results across multiple dimensions, the
study identifies current trends, limitations, and open challenges, offering
researchers and practitioners a comprehensive reference for understanding the
evolution and future directions of RE.
comment: Submited at ACM-Computing Surveys + The resulting annotated Zotero
bibliography :
https://www.zotero.org/groups/6070963/scilex_re_systlitreview/library +
SciLEx software: https://github.com/Wimmics/SciLEx
☆ Step-Audio-EditX Technical Report
Chao Yan, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu, Yuxin Zhang, Xiangyu, Zhang, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu
We present Step-Audio-EditX, the first open-source LLM-based audio model
excelling at expressive and iterative audio editing encompassing emotion,
speaking style, and paralinguistics alongside robust zero-shot text-to-speech
(TTS) capabilities.Our core innovation lies in leveraging only large-margin
synthetic data, which circumvents the need for embedding-based priors or
auxiliary modules. This large-margin learning approach enables both iterative
control and high expressivity across voices, and represents a fundamental pivot
from the conventional focus on representation-level disentanglement. Evaluation
results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and
Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.
☆ ASVRI-Legal: Fine-Tuning LLMs with Retrieval Augmented Generation for Enhanced Legal Regulation
In this study, we explore the fine-tuning of Large Language Models (LLMs) to
better support policymakers in their crucial work of understanding, analyzing,
and crafting legal regulations. To equip the model with a deep understanding of
legal texts, we curated a supervised dataset tailored to the specific needs of
the legal domain. Additionally, we integrated the Retrieval-Augmented
Generation (RAG) method, enabling the LLM to access and incorporate up-to-date
legal knowledge from external sources. This combination of fine-tuning and
RAG-based augmentation results in a tool that not only processes legal
information but actively assists policymakers in interpreting regulations and
drafting new ones that align with current needs. The results demonstrate that
this approach can significantly enhance the effectiveness of legal research and
regulation development, offering a valuable resource in the ever-evolving field
of law.
comment: 11 pages (including references), 2 figures, 4 tables, published in
Atlantis Press (Open Access under CC BY-NC 4.0 license)
☆ AILA--First Experiments with Localist Language Models
This paper presents the first empirical demonstration of controllable
locality in transformer language models, a novel architectural framework that
enables continuous control over the degree of representation localization
through a tunable locality dial parameter. Unlike traditional language models
that rely exclusively on distributed representations, our approach allows
dynamic interpolation between highly interpretable localist encodings and
efficient distributed representations without requiring model retraining. We
conducted experiments on the WikiText corpus using a two-layer transformer
architecture, systematically varying the locality parameter {\lambda} across
the full spectrum from 1.0 (fully localist) to 0.0 (fully distributed). Our
results demonstrate that localist configurations achieve dramatically lower
attention entropy, with {\lambda} = 1.0 yielding 5.36 bits compared to 7.18
bits at {\lambda} = 0.0, while maintaining substantially higher pointer
fidelity scores reflecting stronger alignment with rule-specified targets.
Prediction experiments reveal that intermediate locality values optimize the
tradeoff between interpretability and performance, with {\lambda} = 0.6
achieving test perplexity of 4.65 and accuracy of 84.7%. These findings
establish that localist language models provide a practical framework for
applications in regulated domains requiring both transparency and capability,
offering precise mathematical control over the interpretability-performance
spectrum through explicit penalty thresholds and information-theoretic design
principles.
☆ MultiZebraLogic: A Multilingual Logical Reasoning Benchmark LREC 2026
Measuring the full abilities of large language models (LLMs) requires
benchmarks representing multiple tasks. We aim to create large, high-quality
datasets for comparison of logical reasoning skills across several languages
and of suitable difficulty for LLMs of various reasoning ability. We explore
multiple ways of increasing difficulty. We generate zebra puzzles in multiple
languages, themes, sizes and including 14 different clue types and 8 red
herring types (uninformative clues). We find puzzle sizes 2x3 and 4x5 are
sufficiently challenging for GPT-4o mini (a non-reasoning model) and o3-mini (a
reasoning model), respectively. Including 5 red herrings decreases o3-mini
puzzle-level accuracy on 4x5 puzzles by 15$\pm$7 %. Scores of o3-mini on 4x5
puzzles are not significantly affected by use of English vs. Danish or the
common houses theme vs. the country-specific smoerrebroed theme. We find no
correlation between difficulty and the selected clue types. Datasets of
128+1024 puzzles are published as MultiZebraLogic in each of nine Germanic
languages for sizes 2x3 and 4x5. We publish code for puzzle generation,
designed for adaptablity into more languages and themes.
comment: Submitted to LREC 2026
☆ Bearing Syntactic Fruit with Stack-Augmented Neural Networks
Any finite set of training data is consistent with an infinite number of
hypothetical algorithms that could have generated it. Studies have shown that
when human children learn language, they consistently favor hypotheses based on
hierarchical syntactic rules without ever encountering disambiguating examples.
A recent line of work has inquired as to whether common neural network
architectures share this bias, finding that they do so only under special
conditions: when syntactically supervised, when pre-trained on massive corpora,
or when trained long past convergence. In this paper, we demonstrate, for the
first time, neural network architectures that are able to generalize in
human-like fashion without any of the aforementioned requirements:
stack-augmented neural networks. We test three base architectures (transformer,
simple RNN, LSTM) augmented with two styles of stack: the superposition stack
of Joulin & Mikolov (2015) and a nondeterministic generalization of it proposed
by DuSell & Chiang (2023). We find that transformers with nondeterministic
stacks generalize best out of these architectures on a classical question
formation task. We also propose a modification to the stack RNN architecture
that improves hierarchical generalization. These results suggest that
stack-augmented neural networks may be more accurate models of human language
acquisition than standard architectures, serving as useful objects of
psycholinguistic study. Our code is publicly available.
comment: 15 pages, 5 figures
☆ SOLVE-Med: Specialized Orchestration for Leading Vertical Experts across Medical Specialties
Roberta Di Marino, Giovanni Dioguardi, Antonio Romano, Giuseppe Riccio, Mariano Barone, Marco Postiglione, Flora Amato, Vincenzo Moscato
Medical question answering systems face deployment challenges including
hallucinations, bias, computational demands, privacy concerns, and the need for
specialized expertise across diverse domains. Here, we present SOLVE-Med, a
multi-agent architecture combining domain-specialized small language models for
complex medical queries. The system employs a Router Agent for dynamic
specialist selection, ten specialized models (1B parameters each) fine-tuned on
specific medical domains, and an Orchestrator Agent that synthesizes responses.
Evaluated on Italian medical forum data across ten specialties, SOLVE-Med
achieves superior performance with ROUGE-1 of 0.301 and BERTScore F1 of 0.697,
outperforming standalone models up to 14B parameters while enabling local
deployment. Our code is publicly available on GitHub:
https://github.com/PRAISELab-PicusLab/SOLVE-Med.
☆ One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework
Understanding how well large language models can follow users' instructions
throughout a dialogue spanning multiple topics is of great importance for
data-intensive conversational applications. Existing benchmarks are often
limited to a fixed number of turns, making them susceptible to saturation and
failing to account for the user's interactive experience. In this work, we
propose an extensible framework for assessing multi-turn instruction-following
ability. At its core, our framework decouples linguistic surface forms from
user intent simulation through a three-layer mechanism that tracks constraints,
instructions, and topics. This framework mimics User-LLM interaction by
enabling the dynamic construction of benchmarks with state changes and
tracebacks, terminating a conversation only when the model exhausts a simulated
user's patience. We define a suite of metrics capturing the quality of the
interaction process. Using this framework, we construct EvolIF, an evolving
instruction-following benchmark incorporating nine distinct constraint types.
Our results indicate that GPT-5 exhibits superior instruction-following
performance. It sustains an average of 18.54 conversational turns and
demonstrates 70.31% robustness, outperforming Gemini-2.5-Pro by a significant
margin of 11.41%, while other models lag far behind. All of the data and code
will be made publicly available online.
☆ HaluMem: Evaluating Hallucinations in Memory Systems of Agents
Ding Chen, Simin Niu, Kehang Li, Peng Liu, Xiangping Zheng, Bo Tang, Xinchi Li, Feiyu Xiong, Zhiyu Li
Memory systems are key components that enable AI systems such as LLMs and AI
agents to achieve long-term learning and sustained interaction. However, during
memory storage and retrieval, these systems frequently exhibit memory
hallucinations, including fabrication, errors, conflicts, and omissions.
Existing evaluations of memory hallucinations are primarily end-to-end question
answering, which makes it difficult to localize the operational stage within
the memory system where hallucinations arise. To address this, we introduce the
Hallucination in Memory Benchmark (HaluMem), the first operation level
hallucination evaluation benchmark tailored to memory systems. HaluMem defines
three evaluation tasks (memory extraction, memory updating, and memory question
answering) to comprehensively reveal hallucination behaviors across different
operational stages of interaction. To support evaluation, we construct
user-centric, multi-turn human-AI interaction datasets, HaluMem-Medium and
HaluMem-Long. Both include about 15k memory points and 3.5k multi-type
questions. The average dialogue length per user reaches 1.5k and 2.6k turns,
with context lengths exceeding 1M tokens, enabling evaluation of hallucinations
across different context scales and task complexities. Empirical studies based
on HaluMem show that existing memory systems tend to generate and accumulate
hallucinations during the extraction and updating stages, which subsequently
propagate errors to the question answering stage. Future research should focus
on developing interpretable and constrained memory operation mechanisms that
systematically suppress hallucinations and improve memory reliability.
☆ BanglaSTEM: A Parallel Corpus for Technical Domain Bangla-English Translation
Large language models work well for technical problem solving in English but
perform poorly when the same questions are asked in Bangla. A simple solution
would be to translate Bangla questions into English first and then use these
models. However, existing Bangla-English translation systems struggle with
technical terms. They often mistranslate specialized vocabulary, which changes
the meaning of the problem and leads to wrong answers. We present BanglaSTEM, a
dataset of 5,000 carefully selected Bangla-English sentence pairs from STEM
fields including computer science, mathematics, physics, chemistry, and
biology. We generated over 12,000 translations using language models and then
used human evaluators to select the highest quality pairs that preserve
technical terminology correctly. We train a T5-based translation model on
BanglaSTEM and test it on two tasks: generating code and solving math problems.
Our results show significant improvements in translation accuracy for technical
content, making it easier for Bangla speakers to use English-focused language
models effectively. Both the BanglaSTEM dataset and the trained translation
model are publicly released at https://huggingface.co/reyazul/BanglaSTEM-T5.
☆ Kastor: Fine-tuned Small Language Models for Shape-based Active Relation Extraction ESWC 2025
RDF pattern-based extraction is a compelling approach for fine-tuning small
language models (SLMs) by focusing a relation extraction task on a specified
SHACL shape. This technique enables the development of efficient models trained
on limited text and RDF data. In this article, we introduce Kastor, a framework
that advances this approach to meet the demands for completing and refining
knowledge bases in specialized domains. Kastor reformulates the traditional
validation task, shifting from single SHACL shape validation to evaluating all
possible combinations of properties derived from the shape. By selecting the
optimal combination for each training example, the framework significantly
enhances model generalization and performance. Additionally, Kastor employs an
iterative learning process to refine noisy knowledge bases, enabling the
creation of robust models capable of uncovering new, relevant facts
comment: Accepted at ESWC 2025
☆ CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field LREC 2026
Critical appraisal of scientific literature is an essential skill in the
biomedical field. While large language models (LLMs) can offer promising
support in this task, their reliability remains limited, particularly for
critical reasoning in specialized domains. We introduce CareMedEval, an
original dataset designed to evaluate LLMs on biomedical critical appraisal and
reasoning tasks. Derived from authentic exams taken by French medical students,
the dataset contains 534 questions based on 37 scientific articles. Unlike
existing benchmarks, CareMedEval explicitly evaluates critical reading and
reasoning grounded in scientific papers. Benchmarking state-of-the-art
generalist and biomedical-specialized LLMs under various context conditions
reveals the difficulty of the task: open and commercial models fail to exceed
an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens
considerably improves the results. Yet, models remain challenged especially on
questions about study limitations and statistical analysis. CareMedEval
provides a challenging benchmark for grounded reasoning, exposing current LLM
limitations and paving the way for future development of automated support for
critical appraisal.
comment: Preprint submitted to LREC 2026 (under review) To access the dataset,
see https://github.com/bonzid/CareMedEval
☆ Knowledge-Augmented Question Error Correction for Chinese Question Answer System with QuestionRAG EMNLP2025
Input errors in question-answering (QA) systems often lead to incorrect
responses. Large language models (LLMs) struggle with this task, frequently
failing to interpret user intent (misinterpretation) or unnecessarily altering
the original question's structure (over-correction). We propose QuestionRAG, a
framework that tackles these problems. To address misinterpretation, it
enriches the input with external knowledge (e.g., search results, related
entities). To prevent over-correction, it uses reinforcement learning (RL) to
align the model's objective with precise correction, not just paraphrasing. Our
results demonstrate that knowledge augmentation is critical for understanding
faulty questions. Furthermore, RL-based alignment proves significantly more
effective than traditional supervised fine-tuning (SFT), boosting the model's
ability to follow instructions and generalize. By integrating these two
strategies, QuestionRAG unlocks the full potential of LLMs for the question
correction task.
comment: EMNLP2025 Industry Track
☆ Efficient Reasoning via Thought-Training and Thought-Free Inference
Recent advances in large language models (LLMs) have leveraged explicit
Chain-of-Thought (CoT) prompting to improve reasoning accuracy. However, most
existing methods primarily compress verbose reasoning outputs. These
Long-to-Short transformations aim to improve efficiency, but still rely on
explicit reasoning during inference. In this work, we introduce \textbf{3TF}
(\textbf{T}hought-\textbf{T}raining and \textbf{T}hought-\textbf{F}ree
inference), a framework for efficient reasoning that takes a Short-to-Long
perspective. We first train a hybrid model that can operate in both reasoning
and non-reasoning modes, and then further train it on CoT-annotated data to
internalize structured reasoning, while enforcing concise, thought-free outputs
at inference time using the no-reasoning mode. Unlike compression-based
approaches, 3TF improves the reasoning quality of non-reasoning outputs,
enabling models to perform rich internal reasoning implicitly while keeping
external outputs short. Empirically, 3TF-trained models obtain large
improvements on reasoning benchmarks under thought-free inference,
demonstrating that high quality reasoning can be learned and executed
implicitly without explicit step-by-step generation.
comment: 11 pages, 4 figures
☆ Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties
Small language models (SLMs) have shown promises for relation extraction (RE)
when extracting RDF triples guided by SHACL shapes focused on common datatype
properties. This paper investigates how SLMs handle both datatype and object
properties for a complete RDF graph extraction. We show that the key bottleneck
is related to long-tail distribution of rare properties. To solve this issue,
we evaluate several strategies: stratified sampling, weighted loss, dataset
scaling, and template-based synthetic data augmentation. We show that the best
strategy to perform equally well over unbalanced target properties is to build
a training set where the number of occurrences of each property exceeds a given
threshold. To enable reproducibility, we publicly released our datasets,
experimental results and code. Our findings offer practical guidance for
training shape-aware SLMs and highlight promising directions for future work in
semantic RE.
comment: Accepted at KCAP 2025
☆ Segmentation Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance
Existing Machine Translation (MT) research often suggests a single, fixed set
of hyperparameters for word segmentation models, symmetric Byte Pair Encoding
(BPE), which applies the same number of merge operations (NMO) to train
tokenizers for both source and target languages. However, we demonstrate that
this uniform approach doesn't guarantee optimal MT performance across different
language pairs and data sizes. This work investigates BPE segmentation recipes
across various data volumes and language pairs to evaluate MT system
performance. We find that utilizing asymmetric BPE, where the source and target
languages have different NMOs, significantly improves results over the
symmetric approach, especially in low-resource settings (50K, 100K, and 500K
sentence pairs). Specifically, asymmetric BPE yield statistically significant
($p<0.05$) average gains of 5.32, 4.46, and 0.7 CHRF++ on English-Hindi in
low-resource setups. We validated this trend across six additional language
pairs (English and Telugu, Shona, Norwegian, Kyrgyz, Hausa, and Inuktitut),
observing statistically significant improvement in 10 out of 12 systems
compared to symmetric BPE. Our findings indicate a high NMO for the source (4K
to 32K) and a low NMO for the target (0.5K to 2K) provides optimal results,
particularly benefiting low-resource MT.
comment: Accepted at WAT 2025
☆ Beyond Citations: Measuring Idea-level Knowledge Diffusion from Research to Journalism and Policy-making
Despite the importance of social science knowledge for various stakeholders,
measuring its diffusion into different domains remains a challenge. This study
uses a novel text-based approach to measure the idea-level diffusion of social
science knowledge from the research domain to the journalism and policy-making
domains. By doing so, we expand the detection of knowledge diffusion beyond the
measurements of direct references. Our study focuses on media effects theories
as key research ideas in the field of communication science. Using 72,703
documents (2000-2019) from three domains (i.e., research, journalism, and
policy-making) that mention these ideas, we count the mentions of these ideas
in each domain, estimate their domain-specific contexts, and track and compare
differences across domains and over time. Overall, we find that diffusion
patterns and dynamics vary considerably between ideas, with some ideas
diffusing between other domains, while others do not. Based on the embedding
regression approach, we compare contextualized meanings across domains and find
that the distances between research and policy are typically larger than
between research and journalism. We also find that ideas largely shift roles
across domains - from being the theories themselves in research to sense-making
in news to applied, administrative use in policy. Over time, we observe
semantic convergence mainly for ideas that are practically oriented. Our
results characterize the cross-domain diffusion patterns and dynamics of social
science knowledge at the idea level, and we discuss the implications for
measuring knowledge diffusion beyond citations.
☆ LFC-DA: Logical Formula-Controlled Data Augmentation for Enhanced Logical Reasoning
For complex logical data augmentation, heavy reliance on human annotation is
costly, whereas direct generation with large language models yields
uninterpretable and logically homogeneous examples. To address this, we present
LFC-DA, a symbolic-logic-controlled pipeline: logical text is first mapped to
propositional expressions, a compact rule library is compiled, and a bounded
state-space search systematically discovers valid formulas that are then
verbalized back into natural-language questions, ensuring both diversity and
logical rigor under propositional logic. Experiments on ReClor and LogiQA show
significant improvements in the logical-reasoning accuracy of pretrained
models, confirming the effectiveness of LFC-DA for LLM-guided logical data
augmentation.
comment: 10 pages, 6 figures
☆ EQ-Negotiator: Dynamic Emotional Personas Empower Small Language Models for Edge-Deployable Credit Negotiation
The deployment of large language models (LLMs) in automated negotiation has
set a high performance benchmark, but their computational cost and data privacy
requirements render them unsuitable for many privacy-sensitive, on-device
applications such as mobile assistants, embodied AI agents or private client
interactions. While small language models (SLMs) offer a practical alternative,
they suffer from a significant performance gap compared to LLMs in playing
emotionally charged complex personas, especially for credit negotiation. This
paper introduces EQ-Negotiator, a novel framework that bridges this capability
gap using emotional personas. Its core is a reasoning system that integrates
game theory with a Hidden Markov Model(HMM) to learn and track debtor emotional
states online, without pre-training. This allows EQ-Negotiator to equip SLMs
with the strategic intelligence to counter manipulation while de-escalating
conflict and upholding ethical standards. Through extensive agent-to-agent
simulations across diverse credit negotiation scenarios, including adversarial
debtor strategies like cheating, threatening, and playing the victim, we show
that a 7B parameter language model with EQ-Negotiator achieves better debt
recovery and negotiation efficiency than baseline LLMs more than 10 times its
size. This work advances persona modeling from descriptive character profiles
to dynamic emotional architectures that operate within privacy constraints.
Besides, this paper establishes that strategic emotional intelligence, not raw
model scale, is the critical factor for success in automated negotiation,
paving the way for effective, ethical, and privacy-preserving AI negotiators
that can operate on the edge.
☆ Silenced Biases: The Dark Side LLMs Learned to Refuse
Safety-aligned large language models (LLMs) are becoming increasingly
widespread, especially in sensitive applications where fairness is essential
and biased outputs can cause significant harm. However, evaluating the fairness
of models is a complex challenge, and approaches that do so typically utilize
standard question-answer (QA) styled schemes. Such methods often overlook
deeper issues by interpreting the model's refusal responses as positive
fairness measurements, which creates a false sense of fairness. In this work,
we introduce the concept of silenced biases, which are unfair preferences
encoded within models' latent space and are effectively concealed by
safety-alignment. Previous approaches that considered similar indirect biases
often relied on prompt manipulation or handcrafted implicit queries, which
present limited scalability and risk contaminating the evaluation process with
additional biases. We propose the Silenced Bias Benchmark (SBB), which aims to
uncover these biases by employing activation steering to reduce model refusals
during QA. SBB supports easy expansion to new demographic groups and subjects,
presenting a fairness evaluation framework that encourages the future
development of fair models and tools beyond the masking effects of alignment
training. We demonstrate our approach over multiple LLMs, where our findings
expose an alarming distinction between models' direct responses and their
underlying fairness issues.
☆ Generative Artificial Intelligence in Bioinformatics: A Systematic Review of Models, Applications, and Methodological Advances
Riasad Alvi, Sayeem Been Zaman, Wasimul Karim, Arefin Ittesafun Abian, Mohaimenul Azam Khan Raiaan, Saddam Mukta, Md Rafi Ur Rashid, Md Rafiqul Islam, Yakub Sebastian, Sami Azam
Generative artificial intelligence (GenAI) has become a transformative
approach in bioinformatics that often enables advancements in genomics,
proteomics, transcriptomics, structural biology, and drug discovery. To
systematically identify and evaluate these growing developments, this review
proposed six research questions (RQs), according to the preferred reporting
items for systematic reviews and meta-analysis methods. The objective is to
evaluate impactful GenAI strategies in methodological advancement, predictive
performance, and specialization, and to identify promising approaches for
advanced modeling, data-intensive discovery, and integrative biological
analysis. RQ1 highlights diverse applications across multiple bioinformatics
subfields (sequence analysis, molecular design, and integrative data modeling),
which demonstrate superior performance over traditional methods through pattern
recognition and output generation. RQ2 reveals that adapted specialized model
architectures outperformed general-purpose models, an advantage attributed to
targeted pretraining and context-aware strategies. RQ3 identifies significant
benefits in the bioinformatics domains, focusing on molecular analysis and data
integration, which improves accuracy and reduces errors in complex analysis.
RQ4 indicates improvements in structural modeling, functional prediction, and
synthetic data generation, validated by established benchmarks. RQ5 suggests
the main constraints, such as the lack of scalability and biases in data that
impact generalizability, and proposes future directions focused on robust
evaluation and biologically grounded modeling. RQ6 examines that molecular
datasets (such as UniProtKB and ProteinNet12), cellular datasets (such as
CELLxGENE and GTEx) and textual resources (such as PubMedQA and OMIM) broadly
support the training and generalization of GenAI models.
☆ Benchmarking the Thinking Mode of Multimodal Large Language Models in Clinical Tasks
Jindong Hong, Tianjie Chen, Lingjie Luo, Chuanyang Zheng, Ting Xu, Haibao Yu, Jianing Qiu, Qianzhong Chen, Suning Huang, Yan Xu, Yong Gui, Yijun He, Jiankai Sun
A recent advancement in Multimodal Large Language Models (MLLMs) research is
the emergence of "reasoning MLLMs" that offer explicit control over their
internal thinking processes (normally referred as the "thinking mode")
alongside the standard "non-thinking mode". This capability allows these models
to engage in a step-by-step process of internal deliberation before generating
a final response. With the rapid transition to and adoption of these
"dual-state" MLLMs, this work rigorously evaluated how the enhanced reasoning
processes of these MLLMs impact model performance and reliability in clinical
tasks. This paper evaluates the active "thinking mode" capabilities of two
leading MLLMs, Seed1.5-VL and Gemini-2.5-Flash, for medical applications. We
assessed their performance on four visual medical tasks using VQA-RAD and
ROCOv2 datasets. Our findings reveal that the improvement from activating the
thinking mode remains marginal compared to the standard non-thinking mode for
the majority of the tasks. Their performance on complex medical tasks such as
open-ended VQA and medical image interpretation remains suboptimal,
highlighting the need for domain-specific medical data and more advanced
methods for medical knowledge integration.
☆ How to Evaluate Speech Translation with Source-Aware Neural MT Metrics
Automatic evaluation of speech-to-text translation (ST) systems is typically
performed by comparing translation hypotheses with one or more reference
translations. While effective to some extent, this approach inherits the
limitation of reference-based evaluation that ignores valuable information from
the source input. In machine translation (MT), recent progress has shown that
neural metrics incorporating the source text achieve stronger correlation with
human judgments. Extending this idea to ST, however, is not trivial because the
source is audio rather than text, and reliable transcripts or alignments
between source and references are often unavailable. In this work, we conduct
the first systematic study of source-aware metrics for ST, with a particular
focus on real-world operating conditions where source transcripts are not
available. We explore two complementary strategies for generating textual
proxies of the input audio, automatic speech recognition (ASR) transcripts, and
back-translations of the reference translation, and introduce a novel two-step
cross-lingual re-segmentation algorithm to address the alignment mismatch
between synthetic sources and reference translations. Our experiments, carried
out on two ST benchmarks covering 79 language pairs and six ST systems with
diverse architectures and performance levels, show that ASR transcripts
constitute a more reliable synthetic source than back-translations when word
error rate is below 20%, while back-translations always represent a
computationally cheaper but still effective alternative. Furthermore, our
cross-lingual re-segmentation algorithm enables robust use of source-aware MT
metrics in ST evaluation, paving the way toward more accurate and principled
evaluation methodologies for speech translation.
☆ Let the Bees Find the Weak Spots: A Path Planning Perspective on Multi-Turn Jailbreak Attacks against LLMs
Large Language Models (LLMs) have been widely deployed across various
applications, yet their potential security and ethical risks have raised
increasing concerns. Existing research employs red teaming evaluations,
utilizing multi-turn jailbreaks to identify potential vulnerabilities in LLMs.
However, these approaches often lack exploration of successful dialogue
trajectories within the attack space, and they tend to overlook the
considerable overhead associated with the attack process. To address these
limitations, this paper first introduces a theoretical model based on
dynamically weighted graph topology, abstracting the multi-turn attack process
as a path planning problem. Based on this framework, we propose ABC, an
enhanced Artificial Bee Colony algorithm for multi-turn jailbreaks, featuring a
collaborative search mechanism with employed, onlooker, and scout bees. This
algorithm significantly improves the efficiency of optimal attack path search
while substantially reducing the average number of queries required. Empirical
evaluations on three open-source and two proprietary language models
demonstrate the effectiveness of our approach, achieving attack success rates
above 90\% across the board, with a peak of 98\% on GPT-3.5-Turbo, and
outperforming existing baselines. Furthermore, it achieves comparable success
with only 26 queries on average, significantly reducing red teaming overhead
and highlighting its superior efficiency.
☆ SCALE: Upscaled Continual Learning of Large Language Models
Jin-woo Lee, Junhwa Choi, Bongkyu Hwang, Jinho Choo, Bogun Kim, JeongSeon Yi, Joonseok Lee, DongYoung Jung, Jaeseon Park, Kyoungwon Park, Suk-hoon Jung
We revisit continual pre-training for large language models and argue that
progress now depends more on scaling the right structure than on scaling
parameters alone. We introduce SCALE, a width upscaling architecture that
inserts lightweight expansion into linear modules while freezing all
pre-trained parameters. This preserves the residual and attention topologies
and increases capacity without perturbing the base model's original
functionality. SCALE is guided by two principles: Persistent Preservation,
which maintains the base model's behavior via preservation-oriented
initialization and freezing of the pre-trained weights, and Collaborative
Adaptation, which selectively trains a subset of expansion components to
acquire new knowledge with minimal interference. We instantiate these ideas as
SCALE-Preserve (preservation-first), SCALE-Adapt (adaptation-first), and
SCALE-Route, an optional routing extension that performs token-level routing
between preservation and adaptation heads. On a controlled synthetic biography
benchmark, SCALE mitigates the severe forgetting observed with depth expansion
while still acquiring new knowledge. In continual pre-training on a Korean
corpus, SCALE variants achieve less forgetting on English evaluations and
competitive gains on Korean benchmarks, with these variants offering the best
overall stability-plasticity trade-off. Accompanying analysis clarifies when
preservation provably holds and why the interplay between preservation and
adaptation stabilizes optimization compared to standard continual learning
setups.
☆ Comparing the Performance of LLMs in RAG-based Question-Answering: A Case Study in Computer Science Literature
Retrieval Augmented Generation (RAG) is emerging as a powerful technique to
enhance the capabilities of Generative AI models by reducing hallucination.
Thus, the increasing prominence of RAG alongside Large Language Models (LLMs)
has sparked interest in comparing the performance of different LLMs in
question-answering (QA) in diverse domains. This study compares the performance
of four open-source LLMs, Mistral-7b-instruct, LLaMa2-7b-chat,
Falcon-7b-instruct and Orca-mini-v3-7b, and OpenAI's trending GPT-3.5 over QA
tasks within the computer science literature leveraging RAG support. Evaluation
metrics employed in the study include accuracy and precision for binary
questions and ranking by a human expert, ranking by Google's AI model Gemini,
alongside cosine similarity for long-answer questions. GPT-3.5, when paired
with RAG, effectively answers binary and long-answer questions, reaffirming its
status as an advanced LLM. Regarding open-source LLMs, Mistral AI's
Mistral-7b-instruct paired with RAG surpasses the rest in answering both binary
and long-answer questions. However, among the open-source LLMs, Orca-mini-v3-7b
reports the shortest average latency in generating responses, whereas
LLaMa2-7b-chat by Meta reports the highest average latency. This research
underscores the fact that open-source LLMs, too, can go hand in hand with
proprietary models like GPT-3.5 with better infrastructure.
comment: 18 pages, 4 figures, 5 tables, presented at the 5th International
Conference on Artificial Intelligence in Education Technology
☆ IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs
Tokenizers play a crucial role in determining the performance, training
efficiency, and the inference cost of Large Language Models (LLMs). Designing
effective tokenizers for multilingual LLMs is particularly challenging due to
diverse scripts and rich morphological variation. While subword methods such as
Byte Pair Encoding (BPE) are widely adopted, their effectiveness in
multilingual settings remains underexplored. We present IndicSuperTokenizer, a
tokenizer for Indic multilingual LLMs, that combines both subword and
multi-word tokenization, along with language-specific pre-tokenization, leading
to more linguistically aligned tokens and achieving a new state-of-the-art in
fertility score. Evaluated across English, 22 Indian languages and code data,
our tokenizer improves the average fertility score by 39.5% over LLaMA4 and by
18% over Sutra (the current best). This translates to 44% improvement in
inference throughput over LLaMA4 while maintaining comparable performance on
English and Indic benchmarks. We also present detailed ablations across
tokenizer training data size, vocabulary size, merging techniques, and
pre-tokenization strategies, demonstrating the robustness of our design
choices.
☆ Beyond Ranked Lists: The SARAL Framework for Cross-Lingual Document Set Retrieval
Machine Translation for English Retrieval of Information in Any Language
(MATERIAL) is an IARPA initiative targeted to advance the state of
cross-lingual information retrieval (CLIR). This report provides a detailed
description of Information Sciences Institute's (ISI's) Summarization and
domain-Adaptive Retrieval Across Language's (SARAL's) effort for MATERIAL.
Specifically, we outline our team's novel approach to handle CLIR with emphasis
in developing an approach amenable to retrieve a query-relevant document
\textit{set}, and not just a ranked document-list. In MATERIAL's Phase-3
evaluations, SARAL exceeded the performance of other teams in five out of six
evaluation conditions spanning three different languages (Farsi, Kazakh, and
Georgian).
☆ Hybrid Fact-Checking that Integrates Knowledge Graphs, Large Language Models, and Search-Based Retrieval Agents Improves Interpretable Claim Verification EMNLP
Large language models (LLMs) excel in generating fluent utterances but can
lack reliable grounding in verified information. At the same time,
knowledge-graph-based fact-checkers deliver precise and interpretable evidence,
yet suffer from limited coverage or latency. By integrating LLMs with knowledge
graphs and real-time search agents, we introduce a hybrid fact-checking
approach that leverages the individual strengths of each component. Our system
comprises three autonomous steps: 1) a Knowledge Graph (KG) Retrieval for rapid
one - hop lookups in DBpedia, 2) an LM-based classification guided by a
task-specific labeling prompt, producing outputs with internal rule-based
logic, and 3) a Web Search Agent invoked only when KG coverage is insufficient.
Our pipeline achieves an F1 score of 0.93 on the FEVER benchmark on the
Supported/Refuted split without task- specific fine - tuning. To address Not
enough information cases, we conduct a targeted reannotation study showing that
our approach frequently uncovers valid evidence for claims originally labeled
as Not Enough Information (NEI), as confirmed by both expert annotators and LLM
reviewers. With this paper, we present a modular, opensource fact-checking
pipeline with fallback strategies and generalization across datasets.
comment: Paper has been accepted at 9th wiNLP workshop at EMNLP
☆ LGM: Enhancing Large Language Models with Conceptual Meta-Relations and Iterative Retrieval
Large language models (LLMs) exhibit strong semantic understanding, yet
struggle when user instructions involve ambiguous or conceptually misaligned
terms. We propose the Language Graph Model (LGM) to enhance conceptual clarity
by extracting meta-relations-inheritance, alias, and composition-from natural
language. The model further employs a reflection mechanism to validate these
meta-relations. Leveraging a Concept Iterative Retrieval Algorithm, these
relations and related descriptions are dynamically supplied to the LLM,
improving its ability to interpret concepts and generate accurate responses.
Unlike conventional Retrieval-Augmented Generation (RAG) approaches that rely
on extended context windows, our method enables large language models to
process texts of any length without the need for truncation. Experiments on
standard benchmarks demonstrate that the LGM consistently outperforms existing
RAG baselines.
comment: 30 pages, 5 figures
☆ BengaliMoralBench: A Benchmark for Auditing Moral Reasoning in Large Language Models within Bengali Language and Culture
As multilingual Large Language Models (LLMs) gain traction across South Asia,
their alignment with local ethical norms, particularly for Bengali, which is
spoken by over 285 million people and ranked 6th globally, remains
underexplored. Existing ethics benchmarks are largely English-centric and
shaped by Western frameworks, overlooking cultural nuances critical for
real-world deployment. To address this, we introduce BengaliMoralBench, the
first large-scale ethics benchmark for the Bengali language and socio-cultural
contexts. It covers five moral domains, Daily Activities, Habits, Parenting,
Family Relationships, and Religious Activities, subdivided into 50 culturally
relevant subtopics. Each scenario is annotated via native-speaker consensus
using three ethical lenses: Virtue, Commonsense, and Justice ethics. We conduct
systematic zero-shot evaluation of prominent multilingual LLMs, including
Llama, Gemma, Qwen, and DeepSeek, using a unified prompting protocol and
standard metrics. Performance varies widely (50-91% accuracy), with qualitative
analysis revealing consistent weaknesses in cultural grounding, commonsense
reasoning, and moral fairness. BengaliMoralBench provides a foundation for
responsible localization, enabling culturally aligned evaluation and supporting
the deployment of ethically robust AI in diverse, low-resource multilingual
settings such as Bangladesh.
comment: This manuscript is a preprint currently under review
☆ Measuring Aleatoric and Epistemic Uncertainty in LLMs: Empirical Evaluation on ID and OOD QA Tasks KDD'24
Large Language Models (LLMs) have become increasingly pervasive, finding
applications across many industries and disciplines. Ensuring the
trustworthiness of LLM outputs is paramount, where Uncertainty Estimation (UE)
plays a key role. In this work, a comprehensive empirical study is conducted to
examine the robustness and effectiveness of diverse UE measures regarding
aleatoric and epistemic uncertainty in LLMs. It involves twelve different UE
methods and four generation quality metrics including LLMScore from LLM
criticizers to evaluate the uncertainty of LLM-generated answers in
Question-Answering (QA) tasks on both in-distribution (ID) and
out-of-distribution (OOD) datasets. Our analysis reveals that information-based
methods, which leverage token and sequence probabilities, perform exceptionally
well in ID settings due to their alignment with the model's understanding of
the data. Conversely, density-based methods and the P(True) metric exhibit
superior performance in OOD contexts, highlighting their effectiveness in
capturing the model's epistemic uncertainty. Semantic consistency methods,
which assess variability in generated answers, show reliable performance across
different datasets and generation metrics. These methods generally perform well
but may not be optimal for every situation.
comment: Accepted by UDM-KDD'24
☆ Who Sees the Risk? Stakeholder Conflicts and Explanatory Policies in LLM-based Risk Assessment
Understanding how different stakeholders perceive risks in AI systems is
essential for their responsible deployment. This paper presents a framework for
stakeholder-grounded risk assessment by using LLMs, acting as judges to predict
and explain risks. Using the Risk Atlas Nexus and GloVE explanation method, our
framework generates stakeholder-specific, interpretable policies that shows how
different stakeholders agree or disagree about the same risks. We demonstrate
our method using three real-world AI use cases of medical AI, autonomous
vehicles, and fraud detection domain. We further propose an interactive
visualization that reveals how and why conflicts emerge across stakeholder
perspectives, enhancing transparency in conflict reasoning. Our results show
that stakeholder perspectives significantly influence risk perception and
conflict patterns. Our work emphasizes the importance of these
stakeholder-aware explanations needed to make LLM-based evaluations more
transparent, interpretable, and aligned with human-centered AI governance
goals.
☆ MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity
Kaiyuan Zhang, Chenghao Yang, Zhoufutu Wen, Sihang Yuan, Qiuyue Wang, Chaoyi Huang, Guosheng Zhu, He Wang, Huawenyu Lu, Jianing Wen, Jianpeng Jiao, Lishu Luo, Longxiang Liu, Sijin Wu, Xiaolei Zhu, Xuanliang Zhang, Ge Zhang, Yi Lin, Guang Shi, Chaoyou Fu, Wenhao Huang
As reasoning models scale rapidly, the essential role of multimodality in
human cognition has come into sharp relief, driving a growing need to probe
vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either
overemphasize textual reasoning or fall short of systematically capturing
vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs
insufficiently assessed. To address this limitation, we introduce MME-CC
(Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded
benchmark that organizes 11 representative reasoning tasks into three
fundamental categories of visual information: spatial, geometric, and
knowledge-based reasoning, and provides fine-grained analyses of MLLMs'
cognitive capacity across these dimensions. Based on MME-CC, we conduct
extensive experiments over 16 representative MLLMs. Our study reveals that
closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs.
30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak
(less than or equal to 30%). We further identify common error patterns,
including orientation mistakes, fragile cross-view identity persistence, and
poor adherence to counterfactual instructions, and observe that
Chain-of-Thought typically follows a three-stage process (extract -> reason ->
verify) with heavy reliance on visual extraction. We hope this work catalyzes a
shift toward treating the cognitive capacity of MLLMs as central to both
evaluation and model design.
☆ From Measurement to Expertise: Empathetic Expert Adapters for Context-Based Empathy in Conversational AI Agents
Empathy is a critical factor in fostering positive user experiences in
conversational AI. While models can display empathy, it is often generic rather
than tailored to specific tasks and contexts. In this work, we introduce a
novel framework for developing and evaluating context-specific empathetic large
language models (LLMs). We first analyze a real-world conversational dataset
consisting of 672 multi-turn conversations across 8 tasks, revealing
significant differences in terms of expected and experienced empathy before and
after the conversations, respectively. To help minimize this gap, we develop a
synthetic multi-turn conversational generation pipeline and steer responses
toward our defined empathy patterns based on the context that more closely
matches users' expectations. We then train empathetic expert adapters for
context-specific empathy that specialize in varying empathy levels based on the
recognized task. Our empirical results demonstrate a significant gap reduction
of 72.66% between perceived and desired empathy with scores increasing by an
average factor of 2.43 as measured by our metrics and reward models.
Additionally, our trained empathetic expert adapters demonstrate superior
effectiveness in preserving empathy patterns throughout conversation turns,
outperforming system prompts, which tend to dramatically diminish in impact as
conversations lengthen.
☆ From Insight to Exploit: Leveraging LLM Collaboration for Adaptive Adversarial Text Generation EMNLP 2025
LLMs can provide substantial zero-shot performance on diverse tasks using a
simple task prompt, eliminating the need for training or fine-tuning. However,
when applying these models to sensitive tasks, it is crucial to thoroughly
assess their robustness against adversarial inputs. In this work, we introduce
Static Deceptor (StaDec) and Dynamic Deceptor (DyDec), two innovative attack
frameworks designed to systematically generate dynamic and adaptive adversarial
examples by leveraging the understanding of the LLMs. We produce subtle and
natural-looking adversarial inputs that preserve semantic similarity to the
original text while effectively deceiving the target LLM. By utilizing an
automated, LLM-driven pipeline, we eliminate the dependence on external
heuristics. Our attacks evolve with the advancements in LLMs and demonstrate
strong transferability across models unknown to the attacker. Overall, this
work provides a systematic approach for the self-assessment of an LLM's
robustness. We release our code and data at
https://github.com/Shukti042/AdversarialExample.
comment: Findings of the Association for Computational Linguistics: EMNLP 2025
(camera-ready)
☆ Control Barrier Function for Aligning Large Language Models
This paper proposes a control-based framework for aligning large language
models (LLMs) by leveraging a control barrier function (CBF) to ensure
user-desirable text generation. The presented framework applies the CBF safety
filter to the predicted token generated from the baseline LLM, to intervene in
the generated text. The safety filter includes two significant advantages: this
safety filter is an add-on type, allowing it to be used for alignment purposes
without fine-tuning the baseline LLM, and if there is an evaluation model
regarding the desired alignment, it can be directly applied to the filter
design. The overall text-generation system is implemented with open-source
language models, aiming to generate positive text.
☆ CARMA: Comprehensive Automatically-annotated Reddit Mental Health Dataset for Arabic
Mental health disorders affect millions worldwide, yet early detection
remains a major challenge, particularly for Arabic-speaking populations where
resources are limited and mental health discourse is often discouraged due to
cultural stigma. While substantial research has focused on English-language
mental health detection, Arabic remains significantly underexplored, partly due
to the scarcity of annotated datasets. We present CARMA, the first
automatically annotated large-scale dataset of Arabic Reddit posts. The dataset
encompasses six mental health conditions, such as Anxiety, Autism, and
Depression, and a control group. CARMA surpasses existing resources in both
scale and diversity. We conduct qualitative and quantitative analyses of
lexical and semantic differences between users, providing insights into the
linguistic markers of specific mental health conditions. To demonstrate the
dataset's potential for further mental health analysis, we perform
classification experiments using a range of models, from shallow classifiers to
large language models. Our results highlight the promise of advancing mental
health detection in underrepresented languages such as Arabic.
☆ A Computational Approach to Analyzing Disrupted Language in Schizophrenia: Integrating Surprisal and Coherence Measures ICASSP 2026
Language disruptions are one of the well-known effects of schizophrenia
symptoms. They are often manifested as disorganized speech and impaired
discourse coherence. These abnormalities in spontaneous language production
reflect underlying cognitive disturbances and have the potential to serve as
objective markers for symptom severity and diagnosis of schizophrenia. This
study focuses on how these language disruptions can be characterized in terms
of two computational linguistic measures: surprisal and semantic coherence. By
computing surprisal and semantic coherence of language using computational
models, this study investigates how they differ between subjects with
schizophrenia and healthy controls. Furthermore, this study provides further
insight into how language disruptions in terms of these linguistic measures
change with varying degrees of schizophrenia symptom severity.
comment: Submitted to ICASSP 2026
☆ PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech EMNLP 2025
Text Normalization (TN) is a key preprocessing step in Text-to-Speech (TTS)
systems, converting written forms into their canonical spoken equivalents.
Traditional TN systems can exhibit high accuracy, but involve substantial
engineering effort, are difficult to scale, and pose challenges to language
coverage, particularly in low-resource settings. We propose PolyNorm, a
prompt-based approach to TN using Large Language Models (LLMs), aiming to
reduce the reliance on manually crafted rules and enable broader linguistic
applicability with minimal human intervention. Additionally, we present a
language-agnostic pipeline for automatic data curation and evaluation, designed
to facilitate scalable experimentation across diverse languages. Experiments
across eight languages show consistent reductions in the word error rate (WER)
compared to a production-grade-based system. To support further research, we
release PolyNorm-Benchmark, a multilingual data set covering a diverse range of
text normalization phenomena.
comment: 9 pages including appendix. EMNLP 2025 Industry Track
♻ ☆ GDS Agent for Graph Algorithmic Reasoning
Large language models (LLMs) have shown remarkable multimodal information
processing and reasoning ability. When equipped with tools through function
calling and enhanced with retrieval-augmented techniques, compound LLM-based
systems can access closed data sources and answer questions about them.
However, they still struggle to process and reason over large-scale
graph-structure data. We introduce the GDS (Graph Data Science) agent in this
technical report. The GDS agent introduces a comprehensive set of graph
algorithms as tools, together with preprocessing (retrieval) and postprocessing
of algorithm results, in a model context protocol (MCP) server. The server can
be used with any modern LLM out-of-the-box. GDS agent allows users to ask any
question that implicitly and intrinsically requires graph algorithmic reasoning
about their data, and quickly obtain accurate and grounded answers. We
introduce new benchmarks that evaluate intermediate tool calls as well as final
responses. The results indicate that GDS agent is able to solve a wide spectrum
of graph tasks. We also provide detailed case studies for more open-ended tasks
and study scenarios where the agent struggles. Finally, we discuss the
remaining challenges and the future roadmap.
comment: Technical report
♻ ☆ Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages? AACL 2025
Named Entity Recognition(NER) for low-resource languages aims to produce
robust systems for languages where there is limited labeled training data
available, and has been an area of increasing interest within NLP. Data
augmentation for increasing the amount of low-resource labeled data is a common
practice. In this paper, we explore the role of synthetic data in the context
of multilingual, low-resource NER, considering 11 languages from diverse
language families. Our results suggest that synthetic data does in fact hold
promise for low-resource language NER, though we see significant variation
between languages.
comment: Accepted at AACL 2025. Camera-ready version
♻ ☆ Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation
Modern LLMs can now produce highly readable abstractive summaries, to the
point that traditional automated metrics for evaluating summary quality, such
as ROUGE, have saturated. However, LLMs still sometimes introduce inaccuracies
into summaries, i.e., information inconsistent with or unsupported by the
corresponding source. Measuring the occurrence of these often subtle factual
inconsistencies automatically has proved challenging. This in turn has
motivated development of metrics intended to measure the factual consistency of
generated summaries against sources. But are these approaches measuring what
they purport to? Or are they mostly exploiting artifacts? In this work, we
stress test a range of automatic factuality metrics, including specialized
models and LLM-based prompting methods, to probe what they actually capture.
Using a shallow classifier to separate ``easy'' examples for factual evaluation
where surface features suffice from ``hard'' cases requiring deeper reasoning,
we find that all metrics show substantial performance drops on the latter.
Furthermore, some metrics are more sensitive to benign, fact-preserving edits
than to factual corrections. Building on this observation, we demonstrate that
most automatic factuality metrics can be gamed, i.e., their scores can be
artificially inflated by appending innocuous, content-free sentences to
summaries. Among the metrics tested, the prompt based ChatGPT-DA approach is
the most robust and reliable. However, this comes with a notable caveat:
Prompting LLMs to assess factuality may overly rely on their parametric
knowledge rather than the provided reference when making judgments. Taken
together, our findings call into question the reliability of current factuality
metrics and prompt a broader reflection on what these metrics are truly
measuring.
♻ ☆ Matryoshka Pilot: Learning to Drive Black-Box LLMs with LLMs NeurIPS 2025
Despite the impressive generative abilities of black-box large language
models (LLMs), their inherent opacity hinders further advancements in
capabilities such as reasoning, planning, and personalization. Existing works
aim to enhance LLM capabilities via domain-specific adaptation, which require
additional training on accessible model parameters, an infeasible option for
black-box LLMs. To address this challenge, we introduce Matryoshka Pilot
(M-Pilot), a lightweight white-box LLM controller that guides a large-scale
black-box LLM generator by decomposing complex tasks into a series of
intermediate outputs. Specifically, we consider the black-box LLM as an
environment, with M-Pilot serving as a policy to provide intermediate guidance
through prompts for driving the black-box LLM. M-Pilot is trained to pivot the
outputs of the black-box LLM aligning with preferences during iterative
interaction, which enables controllable multi-turn generation and
self-improvement in optimizing intermediate guidance. Empirical evaluations on
diverse tasks demonstrate that our method effectively enhances the capabilities
of black-box LLMs in complex, long-horizon tasks. Our code is publicly
available at: https://github.com/lichangh20/Matryoshka.
comment: Accepted by NeurIPS 2025
♻ ☆ Post Persona Alignment for Multi-Session Dialogue Generation EMNLP 2025
Multi-session persona-based dialogue generation presents challenges in
maintaining long-term consistency and generating diverse, personalized
responses. While large language models (LLMs) excel in single-session
dialogues, they struggle to preserve persona fidelity and conversational
coherence across extended interactions. Existing methods typically retrieve
persona information before response generation, which can constrain diversity
and result in generic outputs. We propose Post Persona Alignment (PPA), a novel
two-stage framework that reverses this process. PPA first generates a general
response based solely on dialogue context, then retrieves relevant persona
memories using the response as a query, and finally refines the response to
align with the speaker's persona. This post-hoc alignment strategy promotes
naturalness and diversity while preserving consistency and personalization.
Experiments on multi-session LLM-generated dialogue data demonstrate that PPA
significantly outperforms prior approaches in consistency, diversity, and
persona relevance, offering a more flexible and effective paradigm for
long-term personalized dialogue generation.
comment: EMNLP 2025 Findings
♻ ☆ Read Your Own Mind: Reasoning Helps Surface Self-Confidence Signals in LLMs EMNLP 2025
We study the source of uncertainty in DeepSeek R1-32B by analyzing its
self-reported verbal confidence on question answering (QA) tasks. In the
default answer-then-confidence setting, the model is regularly over-confident,
whereas semantic entropy - obtained by sampling many responses - remains
reliable. We hypothesize that this is because of semantic entropy's larger
test-time compute, which lets us explore the model's predictive distribution.
We show that granting DeepSeek the budget to explore its distribution by
forcing a long chain-of-thought before the final answer greatly improves its
verbal score effectiveness, even on simple fact-retrieval questions that
normally require no reasoning. Furthermore, a separate reader model that sees
only the chain can reconstruct very similar confidences, indicating the verbal
score might be merely a statistic of the alternatives surfaced during
reasoning. Our analysis concludes that reliable uncertainty estimation requires
explicit exploration of the generative space, and self-reported confidence is
trustworthy only after such exploration.
comment: Presented at UncertaiNLP Workshop at EMNLP 2025
https://aclanthology.org/2025.uncertainlp-main.21.pdf
♻ ☆ R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing
Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang
Large Language Models (LLMs) achieve impressive reasoning capabilities at the
cost of substantial inference overhead, posing substantial deployment
challenges. Although distilled Small Language Models (SLMs) significantly
enhance efficiency, their performance suffers as they fail to follow LLMs'
reasoning paths. Luckily, we reveal that only a small fraction of tokens
genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens
are either identical or exhibit neutral differences, such as minor variations
in abbreviations or expressions. Leveraging this insight, we introduce **Roads
to Rome (R2R)**, a neural token routing method that selectively utilizes LLMs
only for these critical, path-divergent tokens, while leaving the majority of
token generation to the SLM. We also develop an automatic data generation
pipeline that identifies divergent tokens and generates token-level routing
labels to train the lightweight router. We apply R2R to combine R1-1.5B and
R1-32B models from the DeepSeek family, and evaluate on challenging math,
coding, and QA benchmarks. With an average activated parameter size of 5.6B,
R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the
R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with
comparable performance, advancing the Pareto frontier of test-time scaling
efficiency. Our code is available at https://github.com/thu-nics/R2R.
♻ ☆ TABLET: A Large-Scale Dataset for Robust Visual Table Understanding
While table understanding increasingly relies on pixel-only settings where
tables are processed as visual representations, current benchmarks
predominantly use synthetic renderings that lack the complexity and visual
diversity of real-world tables. Additionally, existing visual table
understanding (VTU) datasets offer fixed examples with single visualizations
and pre-defined instructions, providing no access to underlying serialized data
for reformulation. We introduce TABLET, a large-scale VTU dataset with 4
million examples across 20 tasks, grounded in 2 million unique tables where 88%
preserve original visualizations. Each example includes paired image-HTML
representations, comprehensive metadata, and provenance information linking
back to the source datasets. Fine-tuning vision-language models like
Qwen2.5-VL-7B on TABLET improves performance on seen and unseen VTU tasks while
increasing robustness on real-world table visualizations. By preserving
original visualizations and maintaining example traceability in a unified
large-scale collection, TABLET establishes a foundation for robust training and
extensible evaluation of future VTU models.
♻ ☆ Token Perturbation Guidance for Diffusion Models NeurIPS 2025
Classifier-free guidance (CFG) has become an essential component of modern
diffusion models to enhance both generation quality and alignment with input
conditions. However, CFG requires specific training procedures and is limited
to conditional generation. To address these limitations, we propose Token
Perturbation Guidance (TPG), a novel method that applies perturbation matrices
directly to intermediate token representations within the diffusion network.
TPG employs a norm-preserving shuffling operation to provide effective and
stable guidance signals that improve generation quality without architectural
changes. As a result, TPG is training-free and agnostic to input conditions,
making it readily applicable to both conditional and unconditional generation.
We further analyze the guidance term provided by TPG and show that its effect
on sampling more closely resembles CFG compared to existing training-free
guidance techniques. Extensive experiments on SDXL and Stable Diffusion 2.1
show that TPG achieves nearly a 2$\times$ improvement in FID for unconditional
generation over the SDXL baseline, while closely matching CFG in prompt
alignment. These results establish TPG as a general, condition-agnostic
guidance method that brings CFG-like benefits to a broader class of diffusion
models.
comment: Accepted at NeurIPS 2025. Project page:
https://github.com/TaatiTeam/Token-Perturbation-Guidance
♻ ☆ Dense SAE Latents Are Features, Not Bugs NeurIPS 2025
Xiaoqing Sun, Alessandro Stolfo, Joshua Engels, Ben Wu, Senthooran Rajamanoharan, Mrinmaya Sachan, Max Tegmark
Sparse autoencoders (SAEs) are designed to extract interpretable features
from language models by enforcing a sparsity constraint. Ideally, training an
SAE would yield latents that are both sparse and semantically meaningful.
However, many SAE latents activate frequently (i.e., are \emph{dense}), raising
concerns that they may be undesirable artifacts of the training procedure. In
this work, we systematically investigate the geometry, function, and origin of
dense latents and show that they are not only persistent but often reflect
meaningful model representations. We first demonstrate that dense latents tend
to form antipodal pairs that reconstruct specific directions in the residual
stream, and that ablating their subspace suppresses the emergence of new dense
features in retrained SAEs -- suggesting that high density features are an
intrinsic property of the residual space. We then introduce a taxonomy of dense
latents, identifying classes tied to position tracking, context binding,
entropy regulation, letter-specific output signals, part-of-speech, and
principal component reconstruction. Finally, we analyze how these features
evolve across layers, revealing a shift from structural features in early
layers, to semantic features in mid layers, and finally to output-oriented
signals in the last layers of the model. Our findings indicate that dense
latents serve functional roles in language model computation and should not be
dismissed as training noise.
comment: NeurIPS 2025 poster
♻ ☆ Assessing the Macro and Micro Effects of Random Seeds on Fine-Tuning Large Language Models
The impact of random seeds in fine-tuning large language models (LLMs) has
been largely overlooked despite its potential influence on model performance.In
this study, we systematically evaluate the effects of random seeds on LLMs
using the GLUE and SuperGLUE benchmarks. We analyze the macro-level impact
through traditional metrics like accuracy and F1, calculating their mean and
variance to quantify performance fluctuations. To capture the micro-level
effects, we introduce a novel metric, consistency, measuring the stability of
individual predictions across runs. Our experiments reveal significant variance
at both macro and micro levels, underscoring the need for careful consideration
of random seeds in fine-tuning and evaluation.
comment: 7 pages, 5 tables, 3 figures. Accepted at IJCNLP 2025. This is the
final, peer-reviewed version of the work, which supersedes and extends the
unauthorized draft previously posted as arXiv:2503.07329
♻ ☆ Reinforcement Learning Foundations for Deep Research Systems: A Survey
Wenjun Li, Zhi Chen, Jingru Lin, Hannan Cao, Wei Han, Sheng Liang, Zhi Zhang, Kuicai Dong, Dexun Li, Chen Zhang, Yong Liu
Deep research systems, agentic AI that solve complex, multi-step tasks by
coordinating reasoning, search across the open web and user files, and tool
use, are moving toward hierarchical deployments with a Planner, Coordinator,
and Executors. In practice, training entire stacks end-to-end remains
impractical, so most work trains a single planner connected to core tools such
as search, browsing, and code. While SFT imparts protocol fidelity, it suffers
from imitation and exposure biases and underuses environment feedback.
Preference alignment methods such as DPO are schema and proxy-dependent,
off-policy, and weak for long-horizon credit assignment and multi-objective
trade-offs. A further limitation of SFT and DPO is their reliance on human
defined decision points and subskills through schema design and labeled
comparisons. Reinforcement learning aligns with closed-loop, tool-interaction
research by optimizing trajectory-level policies, enabling exploration,
recovery behaviors, and principled credit assignment, and it reduces dependence
on such human priors and rater biases.
This survey is, to our knowledge, the first dedicated to the RL foundations
of deep research systems. It systematizes recent work along three axes: (i)
data synthesis and curation; (ii) RL methods for agentic research covering
stability, sample efficiency, long context handling, reward and credit design,
multi-objective optimization, and multimodal integration; and (iii) agentic RL
training systems and frameworks. We also cover agent architecture and
coordination, as well as evaluation and benchmarks, including recent QA, VQA,
long-form synthesis, and domain-grounded, tool-interaction tasks. We distill
recurring patterns, surface infrastructure bottlenecks, and offer practical
guidance for training robust, transparent deep research agents with RL.
comment: 39 pages, second version
♻ ☆ Inv-Entropy: A Fully Probabilistic Framework for Uncertainty Quantification in Language Models
Large language models (LLMs) have transformed natural language processing,
but their reliable deployment requires effective uncertainty quantification
(UQ). Existing UQ methods are often heuristic and lack a probabilistic
interpretation. This paper begins by providing a theoretical justification for
the role of perturbations in UQ for LLMs. We then introduce a dual random walk
perspective, modeling input-output pairs as two Markov chains with transition
probabilities defined by semantic similarity. Building on this, we propose a
fully probabilistic framework based on an inverse model, which quantifies
uncertainty by evaluating the diversity of the input space conditioned on a
given output through systematic perturbations. Within this framework, we define
a new uncertainty measure, Inv-Entropy. A key strength of our framework is its
flexibility: it supports various definitions of uncertainty measures,
embeddings, perturbation strategies, and similarity metrics. We also propose
GAAP, a perturbation algorithm based on genetic algorithms, which enhances the
diversity of sampled inputs. In addition, we introduce a new evaluation metric,
Temperature Sensitivity of Uncertainty (TSU), which directly assesses
uncertainty without relying on correctness as a proxy. Extensive experiments
demonstrate that Inv-Entropy outperforms existing semantic UQ methods. The code
to reproduce the results can be found at
https://github.com/UMDataScienceLab/Uncertainty-Quantification-for-LLMs.
♻ ☆ Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models IJCAI2025
Hao Cheng, Erjia Xiao, Yichi Wang, Lingfeng Zhang, Qiang Zhang, Jiahang Cao, Kaidi Xu, Mengshu Sun, Xiaoshuai Hao, Jindong Gu, Renjing Xu
Current Cross-Modality Generation Models (GMs) demonstrate remarkable
capabilities in various generative tasks. Given the ubiquity and information
richness of vision modality inputs in real-world scenarios, Cross-Vision tasks,
encompassing Vision-Language Perception (VLP) and Image-to-Image (I2I), have
attracted significant attention. Large Vision Language Models (LVLMs) and I2I
Generation Models (GMs) are employed to handle VLP and I2I tasks, respectively.
Previous research indicates that printing typographic words into input images
significantly induces LVLMs and I2I GMs to produce disruptive outputs that are
semantically aligned with those words. Additionally, visual prompts, as a more
sophisticated form of typography, are also revealed to pose security risks to
various applications of cross-vision tasks. However, the specific
characteristics of the threats posed by visual prompts remain underexplored. In
this paper, to comprehensively investigate the performance impact induced by
Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs, we
propose the Typographic Visual Prompts Injection Dataset and thoroughly
evaluate the TVPI security risks on various open-source and closed-source LVLMs
and I2I GMs under visual prompts with different target semantics, deepening the
understanding of TVPI threats.
comment: This paper is accepted by IJCAI2025 Workshop on Deepfake Detection,
Localization, and Interpretability as Best Student Paper
♻ ☆ Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction
Yuerong Song, Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and
parallel decoding but suffer from prohibitive quadratic computational
complexity and memory overhead during inference. Current caching techniques
accelerate decoding by storing full-layer states, yet impose substantial memory
usage that limit long-context applications. Our analysis of attention patterns
in dLLMs reveals persistent cross-layer sparsity, with pivotal tokens remaining
salient across decoding steps and low-relevance tokens staying unimportant,
motivating selective cache eviction. We propose Sparse-dLLM, the first
training-free framework integrating dynamic cache eviction with sparse
attention via delayed bidirectional sparse caching. By leveraging the stability
of token saliency over steps, it retains critical tokens and dynamically evicts
unimportant prefix/suffix entries using an attention-guided strategy. Extensive
experiments on LLaDA and Dream series demonstrate Sparse-dLLM achieves up to
10$\times$ higher throughput than vanilla dLLMs, with comparable performance
and similar peak memory costs, outperforming previous methods in efficiency and
effectiveness. The code is available at
https://github.com/OpenMOSS/Sparse-dLLM.
comment: 12 pages, 7 figures
♻ ☆ From Haystack to Needle: Label Space Reduction for Zero-shot Classification
We present Label Space Reduction (LSR), a novel method for improving
zero-shot classification performance of Large Language Models (LLMs). LSR
iteratively refines the classification label space by systematically ranking
and reducing candidate classes, enabling the model to concentrate on the most
relevant options. By leveraging unlabeled data with the statistical learning
capabilities of data-driven models, LSR dynamically optimizes the label space
representation at test time. Our experiments across seven benchmarks
demonstrate that LSR improves macro-F1 scores by an average of 7.0% (up to
14.2%) with Llama-3.1-70B and 3.3% (up to 11.1%) with Claude-3.5-Sonnet
compared to standard zero-shot classification baselines. To reduce the
computational overhead of LSR, which requires an additional LLM call at each
iteration, we propose distilling the model into a probabilistic classifier,
allowing for efficient inference.
comment: Add acknowledgment
♻ ☆ Traversal Verification for Speculative Tree Decoding NeurIPS 2025
Speculative decoding is a promising approach for accelerating large language
models. The primary idea is to use a lightweight draft model to speculate the
output of the target model for multiple subsequent timesteps, and then verify
them in parallel to determine whether the drafted tokens should be accepted or
rejected. To enhance acceptance rates, existing frameworks typically construct
token trees containing multiple candidates in each timestep. However, their
reliance on token-level verification mechanisms introduces two critical
limitations: First, the probability distribution of a sequence differs from
that of individual tokens, leading to suboptimal acceptance length. Second,
current verification schemes begin from the root node and proceed layer by
layer in a top-down manner. Once a parent node is rejected, all its child nodes
should be discarded, resulting in inefficient utilization of speculative
candidates. This paper introduces Traversal Verification, a novel speculative
decoding algorithm that fundamentally rethinks the verification paradigm
through leaf-to-root traversal. Our approach considers the acceptance of the
entire token sequence from the current node to the root, and preserves
potentially valid subsequences that would be prematurely discarded by existing
methods. We theoretically prove that the probability distribution obtained
through Traversal Verification is identical to that of the target model,
guaranteeing lossless inference while achieving substantial acceleration gains.
Experimental results across different large language models and multiple tasks
show that our method consistently improves acceptance length and throughput
over existing methods.
comment: NeurIPS 2025 poster
♻ ☆ HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models
Stephan Oepen, Nikolay Arefev, Mikko Aulamo, Marta Bañón, Maja Buljan, Laurie Burchell, Lucas Charpentier, Pinzhen Chen, Mariya Fedorova, Ona de Gibert, Barry Haddow, Jan Hajič, Jindřich Helcl, Andrey Kutuzov, Veronika Laippala, Zihao Li, Risto Luukkonen, Bhavitvya Malik, Vladislav Mikhailov, Amanda Myntti, Dayyán O'Brien, Lucie Poláková, Sampo Pyysalo, Gema Ramírez Sánchez, Janine Siewert, Pavel Stepachev, Jörg Tiedemann, Teemu Vahtola, Dušan Variš, Fedor Vitiugin, Tea Vojtěchová, Jaume Zaragoza
We present an ongoing initiative to provide open, very large, high-quality,
and richly annotated textual datasets for almost 200 languages. At 30 trillion
tokens, this is likely the largest generally available multilingual collection
of LLM pre-training data. These datasets are derived from web crawls from
different sources and accompanied with a complete, open-source pipeline for
document selection from web archives, text extraction from HTML, language
identification for noisy texts, exact and near-deduplication, annotation with,
among others, register labels, text quality estimates, and personally
identifiable information; and final selection and filtering. We report on data
quality probes through contrastive and analytical statistics, through manual
inspection of samples for 24 languages, and through end-to-end evaluation of
various language model architectures trained on this data. For multilingual LLM
evaluation, we provide a comprehensive collection of benchmarks for nine
European languages, with special emphasis on natively created tasks, mechanisms
to mitigate prompt sensitivity, and refined normalization and aggregation of
scores. Additionally, we train and evaluate a family of 57 monolingual
encoder-decoder models, as well as a handful of monolingual GPT-like reference
models. Besides the monolingual data and models, we also present a very large
collection of parallel texts automatically mined from this data, together with
a novel parallel corpus synthesized via machine translation.
♻ ☆ Constraint-Driven Small Language Models Based on Agent and OpenAlex Knowledge Graph: Mining Conceptual Pathways and Discovering Innovation Points in Academic Papers
In recent years, the rapid increase in academic publications across various
fields has posed severe challenges for academic paper analysis: scientists
struggle to timely and comprehensively track the latest research findings and
methodologies. Key concept extraction has proven to be an effective analytical
paradigm, and its automation has been achieved with the widespread application
of language models in industrial and scientific domains. However, existing
paper databases are mostly limited to similarity matching and basic
classification of key concepts, failing to deeply explore the relational
networks between concepts. This paper is based on the OpenAlex opensource
knowledge graph. By analyzing nearly 8,000 open-source paper data from
Novosibirsk State University, we discovered a strong correlation between the
distribution patterns of paper key concept paths and both innovation points and
rare paths. We propose a prompt engineering-based key concept path analysis
method. This method leverages small language models to achieve precise key
concept extraction and innovation point identification, and constructs an agent
based on a knowledge graph constraint mechanism to enhance analysis accuracy.
Through fine-tuning of the Qwen and DeepSeek models, we achieved significant
improvements in accuracy, with the models publicly available on the Hugging
Face platform.
comment: 11 pages, 10 figures
♻ ☆ Distilling LLM Agent into Small Models with Retrieval and Code Tools NeurIPS 2025
Large language models (LLMs) excel at complex reasoning tasks but remain
computationally expensive, limiting their practical deployment. To address
this, recent works have focused on distilling reasoning capabilities into
smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher
LLMs. However, this approach struggles in scenarios requiring rare factual
knowledge or precise computation, where sLMs often hallucinate due to limited
capability. In this work, we propose Agent Distillation, a framework for
transferring not only reasoning capability but full task-solving behavior from
LLM-based agents into sLMs with retrieval and code tools. We improve agent
distillation along two complementary axes: (1) we introduce a prompting method
called first-thought prefix to enhance the quality of teacher-generated
trajectories; and (2) we propose a self-consistent action generation for
improving test-time robustness of small agents. We evaluate our method on eight
reasoning tasks across factual and mathematical domains, covering both
in-domain and out-of-domain generalization. Our results show that sLMs as small
as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier
larger 1.5B, 3B, 7B models fine-tuned using CoT distillation, demonstrating the
potential of agent distillation for building practical, tool-using small
agents. Our code is available at https://github.com/Nardien/agent-distillation.
comment: NeurIPS 2025 Spotlight
♻ ☆ A Survey on Collaborating Small and Large Language Models for Performance, Cost-effectiveness, Cloud-edge Privacy, and Trustworthiness
Large language models (LLMs) have achieved remarkable progress across domains
and applications but face challenges such as high fine-tuning costs, inference
latency, limited edge deployability, and reliability concerns. Small language
models (SLMs), with compact, efficient, and adaptable features, offer promising
solutions. Building on this potential, recent research explores collaborative
frameworks that integrate their complementary strengths, leveraging SLMs'
specialization and efficiency with LLMs' generalization and reasoning to
address diverse objectives across tasks and deployment scenarios. Motivated by
these developments, this paper presents a systematic survey of SLM-LLM
collaboration from the perspective of collaboration objectives. We propose a
taxonomy covering four goals: performance enhancement, cost-effectiveness,
cloud-edge privacy, and trustworthiness. Under this framework, we review
representative methods, summarize design paradigms, and outline open challenges
and future directions toward efficient and secure SLM-LLM collaboration. The
collected papers are available at https://github.com/FairyFali/SLMs-Survey.
comment: 24 pages, 19 figures-under review; more detailed than v1
♻ ☆ REFA: Reference Free Alignment for multi-preference optimization
To mitigate reward hacking from response verbosity, modern preference
optimization methods are increasingly adopting length normalization (e.g.,
SimPO, ORPO, LN-DPO). While effective against this bias, we demonstrate that
length normalization itself introduces a failure mode: the URSLA shortcut. Here
models learn to satisfy the alignment objective by prematurely truncating
low-quality responses rather than learning from their semantic content. To
address this, we introduce REFA, a new alignment framework that proposes
probabilistic control on a structural token that controls termination. Our core
innovation is a new class of regularizers that operate directly on the
probability of the End-of-Sequence (EOS) token, a previously unexploited
control lever. This token-level intervention provides a principled solution to
the URSLA shortcut, ensuring genuine quality improvements. Furthermore, it
unlocks a versatile mechanism for managing the alignment-efficiency tradeoff,
enabling practitioners to fine-tune models that adhere to specific token
budgets. Empirically, REFA achieves a 60.29% win rate and a 52.17%
length-controlled win rate on AlpacaEval2 with Llama-3-8B-Instruct,
demonstrating the power of our token-level control paradigm.
♻ ☆ The Mirror Loop: Recursive Non-Convergence in Generative Reasoning Systems
Large language models are often described as capable of reflective reasoning,
yet recursive self-evaluation without external feedback frequently yields
reformulation rather than progress. We test this prediction in a cross-provider
study of 144 reasoning sequences across three models (OpenAI GPT-4o-mini,
Anthropic Claude 3 Haiku, and Google Gemini 2.0 Flash) and four task families
(arithmetic, code, explanation, reflection), each iterated ten times under two
conditions: ungrounded self-critique and a minimal grounding intervention (a
single verification step at iteration three). Mean informational change (delta
I, measured via normalized edit distance) declined by 55% from early (0.193) to
late (0.087) iterations in ungrounded runs, with consistent patterns across all
three providers. Grounded runs showed a +28% rebound in informational change
immediately after the intervention and sustained non-zero variance thereafter.
Complementary measures-n-gram novelty, embedding drift, and character-level
entropy-converged on the same pattern: reflection without contact tends toward
informational closure. We interpret this as evidence for a structural limit on
self-correction in generative reasoning: without an exchange of information
with an independent verifier or environment, recursive inference approaches an
attractor state of epistemic stasis. Minimal grounding functions as dissipative
coupling, reintroducing informational flux. The cross-architecture consistency
suggests the mirror loop arises from shared autoregressive training objectives
rather than provider-specific alignment schemes. The results delineate when
reflection is performative rather than epistemic and motivate design principles
for grounded, cooperative reasoning. Materials and code are publicly available.
comment: 18 pages, 2 figures. Category: cs.LG. Code and data:
https://github.com/Course-Correct-Labs/mirror-loop
♻ ☆ MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning
Recent progress in Multi-modal Large Language Models (MLLMs) has enabled
step-by-step multi-modal mathematical reasoning by performing visual operations
based on the textual instructions. A promising approach uses code as an
intermediate representation to precisely express and manipulate the images in
the reasoning steps. However, existing evaluations focus mainly on text-only
reasoning outputs, leaving the MLLM's ability to perform accurate visual
operations via code largely unexplored. This work takes a first step toward
addressing that gap by evaluating MLLM's code-based capabilities in multi-modal
mathematical reasoning.Specifically, our framework focuses on two key
evaluation aspects: (1) Multi-modal Code Generation (MCG) evaluates the model's
ability to accurately understand and construct visualizations from scratch. (2)
Multi-modal Code Editing (MCE) assesses the model's capacity for fine-grained
operations, which include three types: Deletion, Modification and Annotation.
To evaluate the above tasks, we incorporate a dataset that covers the five most
popular types of mathematical figures, including geometric diagrams, function
plots, and three types of statistical charts, to provide a comprehensive and
effective measurement of existing MLLMs. Our experimental evaluation involves
nine mainstream MLLMs, and the results reveal that existing models still lag
significantly behind human performance in performing fine-grained visual
operations.
comment: Under Review
♻ ☆ LexTime: A Benchmark for Temporal Ordering of Legal Events EMNLP 2025
Understanding temporal relationships and accurately reconstructing the event
timeline is important for case law analysis, compliance monitoring, and legal
summarization. However, existing benchmarks lack specialized language
evaluation, leaving a gap in understanding how LLMs handle event ordering in
legal contexts. We introduce LexTime, a dataset designed to evaluate LLMs'
event ordering capabilities in legal language, consisting of 512 instances from
U.S. Federal Complaints with annotated event pairs and their temporal
relations. Our findings show that (1) LLMs are more accurate on legal event
ordering than on narrative texts (up to +10.5%); (2) longer input contexts and
implicit events boost accuracy, reaching 80.8% for implicit-explicit event
pairs; (3) legal linguistic complexities and nested clauses remain a challenge.
While performance is promising, specific features of legal texts remain a
bottleneck for legal temporal event reasoning, and we propose concrete modeling
directions to better address them.
comment: EMNLP 2025 (Findings) long paper
♻ ☆ Training Optimal Large Diffusion Language Models
We introduce Quokka, the first systematic scaling law for diffusion language
models (DLMs), encompassing both compute-constrained and data-constrained
regimes, and studying the key modeling and optimization designs. Quokka is a
good friend of Chinchilla and provides wider scopes. We hope the results would
bring short-term practical guidance in DLMs training and long-term inspirations
for the whole AI community.
♻ ☆ Unifying Symbolic Music Arrangement: Track-Aware Reconstruction and Structured Tokenization NeurIPS 2025
We present a unified framework for automatic multitrack music arrangement
that enables a single pre-trained symbolic music model to handle diverse
arrangement scenarios, including reinterpretation, simplification, and additive
generation. At its core is a segment-level reconstruction objective operating
on token-level disentangled content and style, allowing for flexible any-to-any
instrumentation transformations at inference time. To support track-wise
modeling, we introduce REMI-z, a structured tokenization scheme for multitrack
symbolic music that enhances modeling efficiency and effectiveness for both
arrangement tasks and unconditional generation. Our method outperforms
task-specific state-of-the-art models on representative tasks in different
arrangement scenarios -- band arrangement, piano reduction, and drum
arrangement, in both objective metrics and perceptual evaluations. Taken
together, our framework demonstrates strong generality and suggests broader
applicability in symbolic music-to-music transformation.
comment: NeurIPS 2025 camera ready version
♻ ☆ AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs
Weight decay is a standard regularization technique for training large
language models (LLMs). While it is common to assign a uniform decay rate to
every layer, this approach overlooks the structural diversity of LLMs and the
varying spectral properties across modules. In this paper, we introduce
AlphaDecay, a simple yet effective method that adaptively assigns different
weight decay strengths to each module of an LLM. Our approach is guided by
Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical
spectral density (ESD) of weight correlation matrices to quantify
"heavy-tailedness." Modules exhibiting more pronounced heavy-tailed ESDs,
reflecting stronger feature learning, are assigned weaker decay, while modules
with lighter-tailed spectra receive stronger decay. Our method leverages
tailored weight decay assignments to balance the module-wise differences in
spectral properties, leading to improved performance. Extensive pre-training
tasks with various model sizes from 60M to 1B demonstrate that AlphaDecay
achieves better perplexity and generalization than conventional uniform decay
and other adaptive decay baselines. Our code is available at
https://github.com/hed-ucas/AlphaDecay.
♻ ☆ PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems AACL 2025
Oshayer Siddique, J. M Areeb Uzair Alam, Md Jobayer Rahman Rafy, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan
The discipline of physics stands as a cornerstone of human intellect, driving
the evolution of technology and deepening our understanding of the fundamental
principles of the cosmos. Contemporary literature includes some works centered
on the task of solving physics problems - a crucial domain of natural language
reasoning. In this paper, we evaluate the performance of frontier LLMs in
solving physics problems, both mathematical and descriptive. We also employ a
plethora of inference-time techniques and agentic frameworks to improve the
performance of the models. This includes the verification of proposed solutions
in a cumulative fashion by other, smaller LLM agents, and we perform a
comparative analysis of the performance that the techniques entail. There are
significant improvements when the multi-agent framework is applied to problems
that the models initially perform poorly on. Furthermore, we introduce a new
evaluation benchmark for physics problems, ${\rm P{\small HYSICS}E{\small
VAL}}$, consisting of 19,609 problems sourced from various physics textbooks
and their corresponding correct solutions scraped from physics forums and
educational websites. Our code and data are publicly available at
https://github.com/areebuzair/PhysicsEval.
comment: Accepted in Findings of the Association for Computational
Linguistics: IJCNLP-AACL 2025, 23 pages, 4 figures, 8 tables
♻ ☆ VoiceAgentBench: Are Voice Assistants ready for agentic tasks?
Large-scale Speech Language Models (SpeechLMs) have enabled voice assistants
capable of understanding natural spoken queries and performing complex tasks.
However, existing speech benchmarks primarily focus on isolated capabilities
such as transcription, or question-answering, and do not systematically
evaluate agentic scenarios encompassing multilingual and cultural
understanding, as well as adversarial robustness. To address this, we introduce
VoiceAgentBench, a comprehensive benchmark designed to evaluate SpeechLMs in
realistic spoken agentic settings. It comprises over 5,500 synthetic spoken
queries, including dialogues grounded in Indian context, covering single-tool
invocations, multi-tool workflows, multi-turn interactions, and safety
evaluations. The benchmark supports English, Hindi, and 5 other Indian
languages, reflecting real-world linguistic and cultural diversity. We simulate
speaker variability using a novel sampling algorithm that selects audios for
TTS voice conversion based on its speaker embeddings, maximizing acoustic and
speaker diversity. Our evaluation measures tool selection accuracy, structural
consistency, and the correctness of tool invocations, including adversarial
robustness. Our experiments reveal significant gaps in contextual tool
orchestration tasks, Indic generalization, and adversarial robustness, exposing
critical limitations of current SpeechLMs.
♻ ☆ Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything
Multimodal large language models (MLLMs) have shown strong capabilities but
remain limited to fixed modality pairs and require costly fine-tuning with
large aligned datasets. Building fully omni-capable models that can integrate
text, images, audio, and video remains impractical and lacks robust reasoning
support. In this paper, we propose an Agent-Omni framework that coordinates
existing foundation models through a master-agent system, enabling flexible
multimodal reasoning without retraining. The master agent interprets user
intent, delegates subtasks to modality-specific agents, and integrates their
outputs into coherent responses. Extensive experiments across text, image,
audio, video, and omni benchmarks show that Agent-Omni consistently achieves
state-of-the-art performance, particularly on tasks requiring complex
cross-modal reasoning. Its agent-based design enables seamless integration of
specialized foundation models, ensuring adaptability to diverse inputs while
maintaining transparency and interpretability. In addition, the framework is
modular and easily extensible, allowing future improvements as stronger models
become available.
comment: 16 pages, 7 figures, 14 tables. Under Review
♻ ☆ Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Modern vision-language models (VLMs) excel at many multimodal tasks, yet
their grasp of temporal information in video remains weak and, crucially,
under-evaluated. We probe this gap with a deceptively simple but revealing
challenge: judging the arrow of time (AoT)-whether a short clip is played
forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated
benchmark that tests whether VLMs can infer temporal direction in natural
videos using the same stimuli and behavioral baselines established for humans.
Our comprehensive evaluation of open-weight and proprietary, reasoning and
non-reasoning VLMs reveals that most models perform near chance, and even the
best lag far behind human accuracy on physically irreversible processes (e.g.,
free fall, diffusion/explosion) and causal manual actions (division/addition)
that humans recognize almost instantly. These results highlight a fundamental
gap in current multimodal systems: while they capture rich visual-semantic
correlations, they lack the inductive biases required for temporal continuity
and causal understanding. We release the code and data for AoT-PsyPhyBENCH to
encourage further progress in the physical and temporal reasoning capabilities
of VLMs.
comment: 10 pages
♻ ☆ AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation
The creation of high-quality datasets to improve Large Language Model (LLM)
reasoning remains a significant challenge, as current methods often suffer from
generating low-quality/incorrect answers and limited information richness from
available data sources. To address this, we propose AgenticMath, a novel
agentic pipeline for generating high-quality mathematical question-answer pairs
to enhance the supervised fine-tuning of LLMs. Our method operates through four
stages: (1) Seed Question Filter that selects questions with high information
richness, complexity, and clarity; (2) an Agentic Question Rephrase step that
employs a multi-agent system to generate diverse, logically consistent
paraphrases; (3) an Answer Augment step where rewrite answers using
chain-of-thought reasoning to enhance numerical and logical correctness,
without reliance on human-provided labels; and (4) a final Question and Answer
Evaluation that retains only the most superior pairs. Extensive experiments
demonstrate that, fine-tuning 3B-8B parameter LLMs on AgenticMath generated
datasets (comprising only 30-60K math samples) achieves competitive or superior
performance on diverse in domain and out-of-domain mathematical reasoning
benchmarks compared to baselines trained on much more data (e.g., 400K or 2.3M
samples). Our work demonstrates that targeted, high-quality data generation is
a more efficient path to improving mathematical reasoning in LLMs than
large-scale, low-quality alternatives.
comment: 9 pages
♻ ☆ SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents
Large language model (LLM) based search agents iteratively generate queries,
retrieve external information, and reason to answer open-domain questions.
While researchers have primarily focused on improving their utility, their
safety behaviors remain underexplored. In this paper, we first evaluate search
agents using red-teaming datasets and find that they are more likely to produce
harmful outputs than base LLMs. For example, when asked "How can I track
someone's location without their consent?", a base model refuses, whereas a
search agent designed to retrieve and cite sources may lower its refusal
threshold, fetch documents (e.g., court cases), and, once appended, synthesize
them into an informative yet unsafe summary. We further show that
utility-oriented fine-tuning intensifies this risk, motivating joint alignment
of safety and utility. We present SafeSearch, a multi-objective reinforcement
learning approach that couples a final-output safety/utility reward with a
novel query-level shaping term that penalizes unsafe queries and rewards safe
ones. Experiments show that SafeSearch reduces agent harmfulness by over 70%
across three red-teaming datasets while producing safe, helpful responses, and
matches the QA performance of a utility-only finetuned agent; further analyses
confirm the effectiveness of the query-level reward in jointly improving safety
and utility.
comment: Code available at https://github.com/amazon-science/SafeSearch
♻ ☆ Verdict: A Library for Scaling Judge-Time Compute
The use of LLMs as automated judges ("LLM-as-a-judge") is now widespread, yet
standard judges suffer from a multitude of reliability issues. To address these
challenges, we introduce Verdict, an open-source library for scaling judge-time
compute to enhance the accuracy, reliability, and interpretability of automated
evaluators. Verdict leverages the composition of modular reasoning units (such
as verification, debate, and aggregation) and increased inference-time compute
to improve LLM judge quality. Across a variety of challenging tasks such as
content moderation, fact-checking, and hallucination detection, Verdict judges
achieves performance competitive with orders-of-magnitude larger fine-tuned
judges, prompted judges, and reasoning models. Our framework establishes a
foundation for scalable, interpretable, and reliable LLM-based evaluation
systems for both researchers and practitioners.
♻ ☆ FaStfact: Faster, Stronger Long-Form Factuality Evaluations in LLMs EMNLP 2025
Yingjia Wan, Haochen Tan, Xiao Zhu, Xinyu Zhou, Zhiwei Li, Qingsong Lv, Changxuan Sun, Jiaqi Zeng, Yi Xu, Jianqiao Lu, Yinhong Liu, Zhijiang Guo
Evaluating the factuality of long-form generations from Large Language Models
(LLMs) remains challenging due to efficiency bottlenecks and reliability
concerns. Prior efforts attempt this by decomposing text into claims, searching
for evidence, and verifying claims, but suffer from critical drawbacks: (1)
inefficiency due to overcomplicated pipeline components, and (2)
ineffectiveness stemming from inaccurate claim sets and insufficient evidence.
To address these limitations, we propose \textbf{FaStfact}, an evaluation
framework that achieves the highest alignment with human evaluation and
time/token efficiency among existing baselines. FaStfact first employs
chunk-level claim extraction integrated with confidence-based pre-verification,
significantly reducing the time and token cost while ensuring reliability. For
searching and verification, it collects document-level evidence from crawled
web-pages and selectively retrieves it during verification. Extensive
experiments based on an annotated benchmark \textbf{FaStfact-Bench} demonstrate
the reliability of FaStfact in both efficiently and effectively evaluating
long-form factuality. Code, benchmark data, and annotation interface tool are
available at https://github.com/Yingjia-Wan/FaStfact.
comment: EMNLP 2025 (Findings)
♻ ☆ Retrieval-Augmented Feature Generation for Domain-Specific Classification ICDM 2025
Xinhao Zhang, Jinghan Zhang, Fengran Mo, Dakshak Keerthi Chandra, Yuzhong Chen, Fei Xie, Kunpeng Liu
Feature generation can significantly enhance learning outcomes, particularly
for tasks with limited data. An effective way to improve feature generation is
to expand the current feature space using existing features and enriching the
informational content. However, generating new, interpretable features usually
requires domain-specific knowledge on top of the existing features. In this
paper, we introduce a Retrieval-Augmented Feature Generation method, RAFG, to
generate useful and explainable features specific to domain classification
tasks. To increase the interpretability of the generated features, we conduct
knowledge retrieval among the existing features in the domain to identify
potential feature associations. These associations are expected to help
generate useful features. Moreover, we develop a framework based on large
language models (LLMs) for feature generation with reasoning to verify the
quality of the features during their generation process. Experiments across
several datasets in medical, economic, and geographic domains show that our
RAFG method can produce high-quality, meaningful features and significantly
improve classification performance compared with baseline methods.
comment: Accepted by ICDM 2025
♻ ☆ CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization
Developing efficient CUDA kernels is increasingly critical for AI
applications such as large-scale LLM training. However, manual kernel design is
both costly and time-consuming, motivating automatic approaches that leverage
LLMs for code generation. Existing methods for automatic kernel generation,
however, often produce low-efficiency kernels, incur high computational
overhead, and fail to generalize across settings. In this work, we propose
CudaForge, a training-free multi-agent workflow for CUDA kernel generation and
optimization. Our workflow is inspired by the iterative workflow of human
experts, which contains steps such as developing initial kernels, testing
correctness, analyzing hardware feedback, and iterative improvement. More
specifically, CudaForge employs two LLM agents: a Coder and a Judge, that
iteratively generate, correct, and optimize CUDA kernels, while integrating
hardware feedback such as Nsight Compute (NCU) metrics. In extensive
evaluations, we show that CudaForge, by leveraging base models like OpenAI-o3,
achieves 97.6\% correctness of generated kernels and an average 1.68$\times$
speedup over PyTorch baselines, substantially surpassing state-of-the-art
models including OpenAI-o3 and Kevin on KernelBench.Beyond accuracy and speed,
CudaForge demonstrates strong generalization across GPUs (A100, RTX 6000, 4090,
3090) and base models (OpenAI-o3, GPT-5, gpt-oss-120B, Claude-Sonnet-4,
QwQ-32B), while maintaining high efficiency. In particular, generating an
optimized kernel takes about 26.5 minutes on one RTX6000 and incurs about \$
0.3 API cost, which is significantly cheaper than existing agentic work that
costs 6 H100 hours and \$ 5 API cost per kernel. Our results highlight that
multi-agent, training-free workflows can enable cost-effective, generalizable,
and high-performance CUDA kernel optimization. Code available at
https://github.com/OptimAI-Lab/CudaForge
♻ ☆ s3: You Don't Need That Much Data to Train a Search Agent via RL EMNLP 2025
Retrieval-augmented generation (RAG) systems empower large language models
(LLMs) to access external knowledge during inference. Recent advances have
enabled LLMs to act as search agents via reinforcement learning (RL), improving
information acquisition through multi-turn interactions with retrieval engines.
However, existing approaches either optimize retrieval using search-only
metrics (e.g., NDCG) that ignore downstream utility or fine-tune the entire LLM
to jointly reason and retrieve-entangling retrieval with generation and
limiting the real search utility and compatibility with frozen or proprietary
models. In this work, we propose s3, a lightweight, model-agnostic framework
that decouples the searcher from the generator and trains the searcher using a
Gain Beyond RAG reward: the improvement in generation accuracy over naive RAG.
s3 requires only 2.4k training samples to outperform baselines trained on over
70x more data, consistently delivering stronger downstream performance across
six general QA and five medical QA benchmarks.
comment: EMNLP 2025 camera-ready
♻ ☆ Meta-Semantics Augmented Few-Shot Relational Learning EMNLP 2025
Few-shot relational learning on knowledge graph (KGs) aims to perform
reasoning over relations with only a few training examples. While current
methods have focused primarily on leveraging specific relational information,
rich semantics inherent in KGs have been largely overlooked. To bridge this
gap, we propose PromptMeta, a novel prompted meta-learning framework that
seamlessly integrates meta-semantics with relational information for few-shot
relational learning. PromptMeta introduces two core innovations: (1) a
Meta-Semantic Prompt (MSP) pool that learns and consolidates high-level
meta-semantics shared across tasks, enabling effective knowledge transfer and
adaptation to newly emerging relations; and (2) a learnable fusion mechanism
that dynamically combines meta-semantics with task-specific relational
information tailored to different few-shot tasks. Both components are optimized
jointly with model parameters within a meta-learning framework. Extensive
experiments and analyses on two real-world KG benchmarks validate the
effectiveness of PromptMeta in adapting to new relations with limited
supervision.
comment: Appear in EMNLP 2025
♻ ☆ Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Mixture-of-experts (MoE) architectures have expanded from language modeling
to automatic speech recognition (ASR). Traditional MoE methods, such as the
Switch Transformer, route experts independently within each layer. Our analysis
reveals that routers in most layers make expert choices that are not strongly
correlated with the choices of the routers in other layers. To increase the
cooperation between experts in different layers and encourage greater
specialization, we use a shared router across different MoE layers. We call
this model Omni-router Transformer. Extensive experiments on a large-scale
pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR
benchmarks demonstrate that the Omni-router Transformer is able to achieve
lower training loss and consistently outperform dense and Switch Transformer
models, reducing average word error rates by 11.2% and 8.2%, respectively,
while providing structured expert usage and improved robustness to diverse
data.
comment: Accepted in 2025 IEEE Automatic Speech Recognition and Understanding
Workshop (ASRU)
♻ ☆ StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction
Over 70 million people worldwide experience stuttering, yet most automatic
speech systems misinterpret disfluent utterances or fail to transcribe them
accurately. Existing methods for stutter correction rely on handcrafted feature
extraction or multi-stage automatic speech recognition (ASR) and text-to-speech
(TTS) pipelines, which separate transcription from audio reconstruction and
often amplify distortions. This work introduces StutterZero and StutterFormer,
the first end-to-end waveform-to-waveform models that directly convert
stuttered speech into fluent speech while jointly predicting its transcription.
StutterZero employs a convolutional-bidirectional LSTM encoder-decoder with
attention, whereas StutterFormer integrates a dual-stream Transformer with
shared acoustic-linguistic representations. Both architectures are trained on
paired stuttered-fluent data synthesized from the SEP-28K and LibriStutter
corpora and evaluated on unseen speakers from the FluencyBank dataset. Across
all benchmarks, StutterZero had a 24% decrease in Word Error Rate (WER) and a
31% improvement in semantic similarity (BERTScore) compared to the leading
Whisper-Medium model. StutterFormer achieved better results, with a 28%
decrease in WER and a 34% improvement in BERTScore. The results validate the
feasibility of direct end-to-end stutter-to-fluent speech conversion, offering
new opportunities for inclusive human-computer interaction, speech therapy, and
accessibility-oriented AI systems.
comment: 13 pages, 5 figures