Computation and Language
☆ Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs
Haokun Lin, Haobo Xu, Yichen Wu, Ziyu Guo, Renrui Zhang, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun
Recent advances in diffusion large language models (dLLMs) have introduced a
promising alternative to autoregressive (AR) LLMs for natural language
generation tasks, leveraging full attention and denoising-based decoding
strategies. However, the deployment of these models on edge devices remains
challenging due to their massive parameter scale and high resource demands.
While post-training quantization (PTQ) has emerged as a widely adopted
technique for compressing AR LLMs, its applicability to dLLMs remains largely
unexplored. In this work, we present the first systematic study on quantizing
diffusion-based language models. We begin by identifying the presence of
activation outliers, characterized by abnormally large activation values that
dominate the dynamic range. These outliers pose a key challenge to low-bit
quantization, as they make it difficult to preserve precision for the majority
of values. More importantly, we implement state-of-the-art PTQ methods and
conduct a comprehensive evaluation across multiple task types and model
variants. Our analysis is structured along four key dimensions: bit-width,
quantization method, task category, and model type. Through this
multi-perspective evaluation, we offer practical insights into the quantization
behavior of dLLMs under different configurations. We hope our findings provide
a foundation for future research in efficient dLLM deployment. All codes and
experimental setups will be released to support the community.
comment: Technical Report, Work in Progress
☆ Virtual Community: An Open World for Humans, Robots, and Society
Qinhong Zhou, Hongxin Zhang, Xiangye Lin, Zheyuan Zhang, Yutian Chen, Wenjun Liu, Zunzhe Zhang, Sunli Chen, Lixing Fang, Qiushi Lyu, Xinyu Sun, Jincheng Yang, Zeyuan Wang, Bao Chi Dang, Zhehuan Chen, Daksha Ladia, Jiageng Liu, Chuang Gan
The rapid progress in AI and Robotics may lead to a profound societal
transformation, as humans and robots begin to coexist within shared
communities, introducing both opportunities and challenges. To explore this
future, we present Virtual Community-an open-world platform for humans, robots,
and society-built on a universal physics engine and grounded in real-world 3D
scenes. With Virtual Community, we aim to study embodied social intelligence at
scale: 1) How robots can intelligently cooperate or compete; 2) How humans
develop social relations and build community; 3) More importantly, how
intelligent robots and humans can co-exist in an open world. To support these,
Virtual Community features: 1) An open-source multi-agent physics simulator
that supports robots, humans, and their interactions within a society; 2) A
large-scale, real-world aligned community generation pipeline, including vast
outdoor space, diverse indoor scenes, and a community of grounded agents with
rich characters and appearances. Leveraging Virtual Community, we propose two
novel challenges. The Community Planning Challenge evaluates multi-agent
reasoning and planning ability in open-world settings, such as cooperating to
help agents with daily activities and efficiently connecting other agents. The
Community Robot Challenge requires multiple heterogeneous robots to collaborate
in solving complex open-world tasks. We evaluate various baselines on these
tasks and demonstrate the challenges in both high-level open-world task
planning and low-level cooperation controls. We hope that Virtual Community
will unlock further study of human-robot coexistence within open-world
environments.
comment: website https://virtual-community-ai.github.io/
☆ MedReseacher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework
Ailing Yu, Lan Yao, Jingnan Liu, Zhe Chen, Jiajun Yin, Yuan Wang, Xinhao Liao, Zhiling Ye, Ji Li, Yun Yue, Hansong Xiao, Hualei Zhou, Chunxiao Guo, Peng Wei, Jinjie Gu
Recent developments in Large Language Model (LLM)-based agents have shown
impressive capabilities spanning multiple domains, exemplified by deep research
systems that demonstrate superior performance on complex information-seeking
and synthesis tasks. While general-purpose deep research agents have shown
impressive capabilities, they struggle significantly with medical domain
challenges, as evidenced by leading proprietary systems achieving limited
accuracy on complex medical benchmarks. The key limitations are: (1) the model
lacks sufficient dense medical knowledge for clinical reasoning, and (2) the
framework is constrained by the absence of specialized retrieval tools tailored
for medical contexts.We present a medical deep research agent that addresses
these challenges through two core innovations. First, we develop a novel data
synthesis framework using medical knowledge graphs, extracting the longest
chains from subgraphs around rare medical entities to generate complex
multi-hop question-answer pairs. Second, we integrate a custom-built private
medical retrieval engine alongside general-purpose tools, enabling accurate
medical information synthesis. Our approach generates 2100+ diverse
trajectories across 12 medical specialties, each averaging 4.2 tool
interactions.Through a two-stage training paradigm combining supervised
fine-tuning and online reinforcement learning with composite rewards, our
MedResearcher-R1-32B model demonstrates exceptional performance, establishing
new state-of-the-art results on medical benchmarks while maintaining
competitive performance on general deep research tasks. Our work demonstrates
that strategic domain-specific innovations in architecture, tool design, and
training data construction can enable smaller open-source models to outperform
much larger proprietary systems in specialized domains.
comment: 13 pages, 5 figures
☆ The Prompting Brain: Neurocognitive Markers of Expertise in Guiding Large Language Models
Hend Al-Khalifa, Raneem Almansour, Layan Abdulrahman Alhuasini, Alanood Alsaleh, Mohamad-Hani Temsah, Mohamad-Hani_Temsah, Ashwag Rafea S Alruwaili
Prompt engineering has rapidly emerged as a critical skill for effective
interaction with large language models (LLMs). However, the cognitive and
neural underpinnings of this expertise remain largely unexplored. This paper
presents findings from a cross-sectional pilot fMRI study investigating
differences in brain functional connectivity and network activity between
experts and intermediate prompt engineers. Our results reveal distinct neural
signatures associated with higher prompt engineering literacy, including
increased functional connectivity in brain regions such as the left middle
temporal gyrus and the left frontal pole, as well as altered power-frequency
dynamics in key cognitive networks. These findings offer initial insights into
the neurobiological basis of prompt engineering proficiency. We discuss the
implications of these neurocognitive markers in Natural Language Processing
(NLP). Understanding the neural basis of human expertise in interacting with
LLMs can inform the design of more intuitive human-AI interfaces, contribute to
cognitive models of LLM interaction, and potentially guide the development of
AI systems that better align with human cognitive workflows. This
interdisciplinary approach aims to bridge the gap between human cognition and
machine intelligence, fostering a deeper understanding of how humans learn and
adapt to complex AI systems.
☆ Long Chain-of-Thought Reasoning Across Languages SC
Scaling inference through long chains-of-thought (CoTs) has unlocked
impressive reasoning capabilities in large language models (LLMs), yet the
reasoning process remains almost exclusively English-centric. We construct
translated versions of two popular English reasoning datasets, fine-tune Qwen
2.5 (7B) and Qwen 3 (8B) models, and present a systematic study of long CoT
generation across French, Japanese, Latvian, and Swahili. Our experiments
reveal three key findings. First, the efficacy of using English as a pivot
language varies by language: it provides no benefit for French, improves
performance when used as the reasoning language for Japanese and Latvian, and
proves insufficient for Swahili where both task comprehension and reasoning
remain poor. Second, extensive multilingual pretraining in Qwen 3 narrows but
does not eliminate the cross-lingual performance gap. A lightweight fine-tune
using only 1k traces still improves performance by over 30\% in Swahili. Third,
data quality versus scale trade-offs are language dependent: small, carefully
curated datasets suffice for English and French, whereas larger but noisier
corpora prove more effective for Swahili and Latvian. Together, these results
clarify when and why long CoTs transfer across languages and provide translated
datasets to foster equitable multilingual reasoning research.
comment: Accepted to SCALR @ COLM 2025
☆ Evaluating Retrieval-Augmented Generation vs. Long-Context Input for Clinical Reasoning over EHRs
Skatje Myers, Dmitriy Dligach, Timothy A. Miller, Samantha Barr, Yanjun Gao, Matthew Churpek, Anoop Mayampurath, Majid Afshar
Electronic health records (EHRs) are long, noisy, and often redundant, posing
a major challenge for the clinicians who must navigate them. Large language
models (LLMs) offer a promising solution for extracting and reasoning over this
unstructured text, but the length of clinical notes often exceeds even
state-of-the-art models' extended context windows. Retrieval-augmented
generation (RAG) offers an alternative by retrieving task-relevant passages
from across the entire EHR, potentially reducing the amount of required input
tokens. In this work, we propose three clinical tasks designed to be replicable
across health systems with minimal effort: 1) extracting imaging procedures, 2)
generating timelines of antibiotic use, and 3) identifying key diagnoses. Using
EHRs from actual hospitalized patients, we test three state-of-the-art LLMs
with varying amounts of provided context, using either targeted text retrieval
or the most recent clinical notes. We find that RAG closely matches or exceeds
the performance of using recent notes, and approaches the performance of using
the models' full context while requiring drastically fewer input tokens. Our
results suggest that RAG remains a competitive and efficient approach even as
newer models become capable of handling increasingly longer amounts of text.
☆ Privileged Self-Access Matters for Introspection in AI
Whether AI models can introspect is an increasingly important practical
question. But there is no consensus on how introspection is to be defined.
Beginning from a recently proposed ''lightweight'' definition, we argue instead
for a thicker one. According to our proposal, introspection in AI is any
process which yields information about internal states through a process more
reliable than one with equal or lower computational cost available to a third
party. Using experiments where LLMs reason about their internal temperature
parameters, we show they can appear to have lightweight introspection while
failing to meaningfully introspect per our proposed definition.
☆ TransLLM: A Unified Multi-Task Foundation Framework for Urban Transportation via Learnable Prompting
Urban transportation systems encounter diverse challenges across multiple
tasks, such as traffic forecasting, electric vehicle (EV) charging demand
prediction, and taxi dispatch. Existing approaches suffer from two key
limitations: small-scale deep learning models are task-specific and
data-hungry, limiting their generalizability across diverse scenarios, while
large language models (LLMs), despite offering flexibility through natural
language interfaces, struggle with structured spatiotemporal data and numerical
reasoning in transportation domains. To address these limitations, we propose
TransLLM, a unified foundation framework that integrates spatiotemporal
modeling with large language models through learnable prompt composition. Our
approach features a lightweight spatiotemporal encoder that captures complex
dependencies via dilated temporal convolutions and dual-adjacency graph
attention networks, seamlessly interfacing with LLMs through structured
embeddings. A novel instance-level prompt routing mechanism, trained via
reinforcement learning, dynamically personalizes prompts based on input
characteristics, moving beyond fixed task-specific templates. The framework
operates by encoding spatiotemporal patterns into contextual representations,
dynamically composing personalized prompts to guide LLM reasoning, and
projecting the resulting representations through specialized output layers to
generate task-specific predictions. Experiments across seven datasets and three
tasks demonstrate the exceptional effectiveness of TransLLM in both supervised
and zero-shot settings. Compared to ten baseline models, it delivers
competitive performance on both regression and planning problems, showing
strong generalization and cross-task adaptability. Our code is available at
https://github.com/BiYunying/TransLLM.
☆ Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference
Large language models (LLMs) are increasingly applied in multilingual
contexts, yet their capacity for consistent, logically grounded alignment
across languages remains underexplored. We present a controlled evaluation
framework for multilingual natural language inference (NLI) that generates
synthetic, logic-based premise-hypothesis pairs and translates them into a
typologically diverse set of languages. This design enables precise control
over semantic relations and allows testing in both monolingual and
mixed-language (code-switched) conditions. Surprisingly, code-switching does
not degrade, and can even improve, performance, suggesting that
translation-induced lexical variation may serve as a regularization signal. We
validate semantic preservation through embedding-based similarity analyses and
cross-lingual alignment visualizations, confirming the fidelity of translated
pairs. Our findings expose both the potential and the brittleness of current
LLM cross-lingual reasoning, and identify code-switching as a promising lever
for improving multilingual robustness. Code available at:
https://github.com/KurbanIntelligenceLab/nli-stress-testing
comment: Under review
☆ Transplant Then Regenerate: A New Paradigm for Text Data Augmentation EMNLP 2025
Data augmentation is a critical technique in deep learning. Traditional
methods like Back-translation typically focus on lexical-level rephrasing,
which primarily produces variations with the same semantics. While large
language models (LLMs) have enhanced text augmentation by their "knowledge
emergence" capability, controlling the style and structure of these outputs
remains challenging and requires meticulous prompt engineering. In this paper,
we propose LMTransplant, a novel text augmentation paradigm leveraging LLMs.
The core idea of LMTransplant is transplant-then-regenerate: incorporating seed
text into a context expanded by LLM, and asking the LLM to regenerate a variant
based on the expanded context. This strategy allows the model to create more
diverse and creative content-level variants by fully leveraging the knowledge
embedded in LLMs, while preserving the core attributes of the original text. We
evaluate LMTransplant across various text-related tasks, demonstrating its
superior performance over existing text augmentation methods. Moreover,
LMTransplant demonstrates exceptional scalability as the size of augmented data
grows.
comment: Accepted by EMNLP 2025
☆ The Digital Sous Chef -- A Comparative Study on Fine-Tuning Language Models for Recipe Generation
We established a rigorous benchmark for text-based recipe generation, a
fundamental task in natural language generation. We present a comprehensive
comparative study contrasting a fine-tuned GPT-2 large (774M) model against the
GPT-2 small (124M) model and traditional LSTM/RNN baselines on the 5-cuisine
corpus from RecipeDB. Our key contribution is a targeted tokenization strategy
that augments the vocabulary with 23 common fraction tokens and custom
structural markers. This approach addresses a critical limitation of generic
tokenizers by preserving essential recipe structures and precise numerical
quantities, thereby enhancing domain specificity. Performance is evaluated
using a comprehensive suite of seven automatic metrics spanning fluency
(BLEU-4, METEOR), coherence (ROUGE-L), semantic relevance (BERTScore), and
diversity. Our experiments show that the large transformer-based approach
yields a >20% relative improvement in BERTScore (F1) (0.92 vs 0.72) over the
best recurrent baseline, while reducing perplexity by 69.8%. We conclude with a
discussion of remaining challenges, particularly regarding factual accuracy,
and outline how this foundational study paves the way for integrating
real-world constraints and multi-modal inputs in advanced recipe generation
research.
comment: 8 pages, 4 figures. Code is available at:
https://github.com/shubh-iiit/RecipeGPT2-Your-Own-AI-Chef
☆ ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine
Junying Chen, Zhenyang Cai, Zhiheng Liu, Yunjin Yang, Rongsheng Wang, Qingying Xiao, Xiangyi Feng, Zhan Su, Jing Guo, Xiang Wan, Guangjun Yu, Haizhou Li, Benyou Wang
Despite the success of large language models (LLMs) in various domains, their
potential in Traditional Chinese Medicine (TCM) remains largely underexplored
due to two critical barriers: (1) the scarcity of high-quality TCM data and (2)
the inherently multimodal nature of TCM diagnostics, which involve looking,
listening, smelling, and pulse-taking. These sensory-rich modalities are beyond
the scope of conventional LLMs. To address these challenges, we present
ShizhenGPT, the first multimodal LLM tailored for TCM. To overcome data
scarcity, we curate the largest TCM dataset to date, comprising 100GB+ of text
and 200GB+ of multimodal data, including 1.2M images, 200 hours of audio, and
physiological signals. ShizhenGPT is pretrained and instruction-tuned to
achieve deep TCM knowledge and multimodal reasoning. For evaluation, we collect
recent national TCM qualification exams and build a visual benchmark for
Medicinal Recognition and Visual Diagnosis. Experiments demonstrate that
ShizhenGPT outperforms comparable-scale LLMs and competes with larger
proprietary models. Moreover, it leads in TCM visual understanding among
existing multimodal LLMs and demonstrates unified perception across modalities
like sound, pulse, smell, and vision, paving the way toward holistic multimodal
perception and diagnosis in TCM. Datasets, models, and code are publicly
available. We hope this work will inspire further exploration in this field.
☆ MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers
Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, Junnan Li
The Model Context Protocol has emerged as a transformative standard for
connecting large language models to external data sources and tools, rapidly
gaining adoption across major AI providers and development platforms. However,
existing benchmarks are overly simplistic and fail to capture real application
challenges such as long-horizon reasoning and large, unfamiliar tool spaces. To
address this critical gap, we introduce MCP-Universe, the first comprehensive
benchmark specifically designed to evaluate LLMs in realistic and hard tasks
through interaction with real-world MCP servers. Our benchmark encompasses 6
core domains spanning 11 different MCP servers: Location Navigation, Repository
Management, Financial Analysis, 3D Design, Browser Automation, and Web
Searching. To ensure rigorous evaluation, we implement execution-based
evaluators, including format evaluators for agent format compliance, static
evaluators for time-invariant content matching, and dynamic evaluators that
automatically retrieve real-time ground truth for temporally sensitive tasks.
Through extensive evaluation of leading LLMs, we find that even SOTA models
such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit
significant performance limitations. In addition, our benchmark poses a
significant long-context challenge for LLM agents, as the number of input
tokens increases rapidly with the number of interaction steps. Moreover, it
introduces an unknown-tools challenge, as LLM agents often lack familiarity
with the precise usage of the MCP servers. Notably, enterprise-level agents
like Cursor cannot achieve better performance than standard ReAct frameworks.
Beyond evaluation, we open-source our extensible evaluation framework with UI
support, enabling researchers and practitioners to seamlessly integrate new
agents and MCP servers while fostering innovation in the rapidly evolving MCP
ecosystem.
comment: Website: https://mcp-universe.github.io
☆ Improving in-context learning with a better scoring function
Large language models (LLMs) exhibit a remarkable capacity to learn by
analogy, known as in-context learning (ICL). However, recent studies have
revealed limitations in this ability. In this paper, we examine these
limitations on tasks involving first-order quantifiers such as {\em all} and
{\em some}, as well as on ICL with linear functions. We identify Softmax, the
scoring function in attention mechanism, as a contributing factor to these
constraints. To address this, we propose \textbf{scaled signed averaging
(SSA)}, a novel alternative to Softmax. Empirical results show that SSA
dramatically improves performance on our target tasks. Furthermore, we evaluate
both encoder-only and decoder-only transformers models with SSA, demonstrating
that they match or exceed their Softmax-based counterparts across a variety of
linguistic probing tasks.
☆ Continuous sentiment scores for literary and multilingual contexts
Sentiment Analysis is widely used to quantify sentiment in text, but its
application to literary texts poses unique challenges due to figurative
language, stylistic ambiguity, as well as sentiment evocation strategies.
Traditional dictionary-based tools often underperform, especially for
low-resource languages, and transformer models, while promising, typically
output coarse categorical labels that limit fine-grained analysis. We introduce
a novel continuous sentiment scoring method based on concept vector projection,
trained on multilingual literary data, which more effectively captures nuanced
sentiment expressions across genres, languages, and historical periods. Our
approach outperforms existing tools on English and Danish texts, producing
sentiment scores whose distribution closely matches human ratings, enabling
more accurate analysis and sentiment arc modeling in literature.
comment: 16 pages after compiling, 3025 words, 6 figures, 5 tables and an
algorithm
☆ Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek
Southern Uzbek (uzs) is a Turkic language variety spoken by around 5 million
people in Afghanistan and differs significantly from Northern Uzbek (uzn) in
phonology, lexicon, and orthography. Despite the large number of speakers,
Southern Uzbek is underrepresented in natural language processing. We present
new resources for Southern Uzbek machine translation, including a 997-sentence
FLORES+ dev set, 39,994 parallel sentences from dictionary, literary, and web
sources, and a fine-tuned NLLB-200 model (lutfiy). We also propose a
post-processing method for restoring Arabic-script half-space characters, which
improves handling of morphological boundaries. All datasets, models, and tools
are released publicly to support future work on Southern Uzbek and other
low-resource languages.
☆ Towards Skeletal and Signer Noise Reduction in Sign Language Production via Quaternion-Based Pose Encoding and Contrastive Learning
One of the main challenges in neural sign language production (SLP) lies in
the high intra-class variability of signs, arising from signer morphology and
stylistic variety in the training data. To improve robustness to such
variations, we propose two enhancements to the standard Progressive
Transformers (PT) architecture (Saunders et al., 2020). First, we encode poses
using bone rotations in quaternion space and train with a geodesic loss to
improve the accuracy and clarity of angular joint movements. Second, we
introduce a contrastive loss to structure decoder embeddings by semantic
similarity, using either gloss overlap or SBERT-based sentence similarity,
aiming to filter out anatomical and stylistic features that do not convey
relevant semantic information. On the Phoenix14T dataset, the contrastive loss
alone yields a 16% improvement in Probability of Correct Keypoint over the PT
baseline. When combined with quaternion-based pose encoding, the model achieves
a 6% reduction in Mean Bone Angle Error. These results point to the benefit of
incorporating skeletal structure modeling and semantically guided contrastive
objectives on sign pose representations into the training of Transformer-based
SLP models.
☆ Who Sees What? Structured Thought-Action Sequences for Epistemic Reasoning in LLMs
Luca Annese, Sabrina Patania, Silvia Serino, Tom Foulsham, Silvia Rossi, Azzurra Ruggeri, Dimitri Ognibene
Recent advances in large language models (LLMs) and reasoning frameworks have
opened new possibilities for improving the perspective -taking capabilities of
autonomous agents. However, tasks that involve active perception, collaborative
reasoning, and perspective taking (understanding what another agent can see or
knows) pose persistent challenges for current LLM-based systems. This study
investigates the potential of structured examples derived from transformed
solution graphs generated by the Fast Downward planner to improve the
performance of LLM-based agents within a ReAct framework. We propose a
structured solution-processing pipeline that generates three distinct
categories of examples: optimal goal paths (G-type), informative node paths
(E-type), and step-by-step optimal decision sequences contrasting alternative
actions (L-type). These solutions are further converted into ``thought-action''
examples by prompting an LLM to explicitly articulate the reasoning behind each
decision. While L-type examples slightly reduce clarification requests and
overall action steps, they do not yield consistent improvements. Agents are
successful in tasks requiring basic attentional filtering but struggle in
scenarios that required mentalising about occluded spaces or weighing the costs
of epistemic actions. These findings suggest that structured examples alone are
insufficient for robust perspective-taking, underscoring the need for explicit
belief tracking, cost modelling, and richer environments to enable socially
grounded collaboration in LLM-based agents.
comment: Accepted at ICSR25
☆ EmoTale: An Enacted Speech-emotion Dataset in Danish
While multiple emotional speech corpora exist for commonly spoken languages,
there is a lack of functional datasets for smaller (spoken) languages, such as
Danish. To our knowledge, Danish Emotional Speech (DES), published in 1997, is
the only other database of Danish emotional speech. We present EmoTale; a
corpus comprising Danish and English speech recordings with their associated
enacted emotion annotations. We demonstrate the validity of the dataset by
investigating and presenting its predictive power using speech emotion
recognition (SER) models. We develop SER models for EmoTale and the reference
datasets using self-supervised speech model (SSLM) embeddings and the openSMILE
feature extractor. We find the embeddings superior to the hand-crafted
features. The best model achieves an unweighted average recall (UAR) of 64.1%
on the EmoTale corpus using leave-one-speaker-out cross-validation, comparable
to the performance on DES.
comment: To appear in the proceedings of ASRU 2025
☆ Reasoning is about giving reasons
Convincing someone of the truth value of a premise requires understanding and
articulating the core logical structure of the argument which proves or
disproves the premise. Understanding the logical structure of an argument
refers to understanding the underlying "reasons" which make up the proof or
disproof of the premise - as a function of the "logical atoms" in the argument.
While it has been shown that transformers can "chain" rules to derive simple
arguments, the challenge of articulating the "reasons" remains. Not only do
current approaches to chaining rules suffer in terms of their interpretability,
they are also quite constrained in their ability to accommodate extensions to
theoretically equivalent reasoning tasks - a model trained to chain rules
cannot support abduction or identify contradictions. In this work we suggest
addressing these shortcomings by identifying an intermediate representation
(which we call the Representation of the Logical Structure (RLS) of the
argument) that possesses an understanding of the logical structure of a natural
language argument - the logical atoms in the argument and the rules
incorporating them. Given the logical structure, reasoning is deterministic and
easy to compute. Therefore, our approach supports all forms of reasoning that
depend on the logical structure of the natural language argument, including
arbitrary depths of reasoning, on-the-fly mistake rectification and interactive
discussion with respect to an argument. We show that we can identify and
extract the logical structure of natural language arguments in three popular
reasoning datasets with high accuracies, thus supporting explanation generation
and extending the reasoning capabilities significantly.
☆ In2x at WMT25 Translation Task
This paper presents the open-system submission by the In2x research team for
the WMT25 General Machine Translation Shared Task. Our submission focuses on
Japanese-related translation tasks, aiming to explore a generalizable paradigm
for extending large language models (LLMs) to other languages. This paradigm
encompasses aspects such as data construction methods and reward model design.
The ultimate goal is to enable large language model systems to achieve
exceptional performance in low-resource or less commonly spoken languages.
☆ DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization
Shuaijie She, Yu Bao, Yu Lu, Lu Xu, Tao Li, Wenhao Zhu, Shujian Huang, Shanbo Cheng, Lu Lu, Yuxuan Wang
We present DuPO, a dual learning-based preference optimization framework that
generates annotation-free feedback via a generalized duality. DuPO addresses
two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'s
reliance on costly labels and applicability restricted to verifiable tasks, and
traditional dual learning's restriction to strictly dual task pairs (e.g.,
translation and back-translation). Specifically, DuPO decomposes a primal
task's input into known and unknown components, then constructs its dual task
to reconstruct the unknown part using the primal output and known information
(e.g., reversing math solutions to recover hidden variables), broadening
applicability to non-invertible tasks. The quality of this reconstruction
serves as a self-supervised reward to optimize the primal task, synergizing
with LLMs' ability to instantiate both tasks via a single model. Empirically,
DuPO achieves substantial gains across diverse tasks: it enhances the average
translation quality by 2.13 COMET over 756 directions, boosts the mathematical
reasoning accuracy by an average of 6.4 points on three challenge benchmarks,
and enhances performance by 9.3 points as an inference-time reranker (trading
computation for accuracy). These results position DuPO as a scalable, general,
and annotation-free paradigm for LLM optimization.
comment: 18 pages, 4 figures,
☆ NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model
NVIDIA, :, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adi Renduchintala, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, Alex Kondratenko, Alex Shaposhnikov, Ali Taghibakhshi, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amy Shen, Andrew Tao, Ann Guan, Anna Shors, Anubhav Mandarwal, Arham Mehta, Arun Venkatesan, Ashton Sharabiani, Ashwath Aithal, Ashwin Poojary, Ayush Dattagupta, Balaram Buddharaju, Banghua Zhu, Barnaby Simkin, Bilal Kartal, Bita Darvish Rouhani, Bobby Chen, Boris Ginsburg, Brandon Norick, Brian Yu, Bryan Catanzaro, Charles Wang, Charlie Truong, Chetan Mungekar, Chintan Patel, Chris Alexiuk, Christian Munley, Christopher Parisien, Dan Su, Daniel Afrimi, Daniel Korzekwa, Daniel Rohrer, Daria Gitman, David Mosallanezhad, Deepak Narayanan, Dima Rekesh, Dina Yared, Dmytro Pykhtar, Dong Ahn, Duncan Riach, Eileen Long, Elliott Ning, Eric Chung, Erick Galinkin, Evelina Bakhturina, Gargi Prasad, Gerald Shen, Haim Elisha, Harsh Sharma, Hayley Ross, Helen Ngo, Herman Sahota, Hexin Wang, Hoo Chang Shin, Hua Huang, Iain Cunningham, Igor Gitman, Ivan Moshkov, Jaehun Jung, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jimmy Zhang, Jinze Xue, Jocelyn Huang, Joey Conway, John Kamalu, Jonathan Cohen, Joseph Jennings, Julien Veron Vialard, Junkeun Yi, Jupinder Parmar, Kari Briski, Katherine Cheung, Katherine Luna, Keith Wyss, Keshav Santhanam, Kezhi Kong, Krzysztof Pawelec, Kumar Anik, Kunlun Li, Kushan Ahmadian, Lawrence McAfee, Laya Sleiman, Leon Derczynski, Luis Vega, Maer Rodrigues de Melo, Makesh Narsimhan Sreedhar, Marcin Chochowski, Mark Cai, Markus Kliegl, Marta Stepniewska-Dziubinska, Matvei Novikov, Mehrzad Samadi, Meredith Price, Meriem Boubdir, Michael Boone, Michael Evans, Michal Bien, Michal Zawalski, Miguel Martinez, Mike Chrzanowski, Mohammad Shoeybi, Mostofa Patwary, Namit Dhameja, Nave Assaf, Negar Habibi, Nidhi Bhatia, Nikki Pope, Nima Tajbakhsh, Nirmal Kumar Juluru, Oleg Rybakov, Oleksii Hrinchuk, Oleksii Kuchaiev, Oluwatobi Olabiyi, Pablo Ribalta, Padmavathy Subramanian, Parth Chadha, Pavlo Molchanov, Peter Dykas, Peter Jin, Piotr Bialecki, Piotr Januszewski, Pradeep Thalasta, Prashant Gaikwad, Prasoon Varshney, Pritam Gundecha, Przemek Tredak, Rabeeh Karimi Mahabadi, Rajen Patel, Ran El-Yaniv, Ranjit Rajan, Ria Cheruvu, Rima Shahbazyan, Ritika Borkar, Ritu Gala, Roger Waleffe, Ruoxi Zhang, Russell J. Hewett, Ryan Prenger, Sahil Jain, Samuel Kriman, Sanjeev Satheesh, Saori Kaji, Sarah Yurick, Saurav Muralidharan, Sean Narenthiran, Seonmyeong Bak, Sepehr Sameni, Seungju Han, Shanmugam Ramasamy, Shaona Ghosh, Sharath Turuvekere Sreenivas, Shelby Thomas, Shizhe Diao, Shreya Gopal, Shrimai Prabhumoye, Shubham Toshniwal, Shuoyang Ding, Siddharth Singh, Siddhartha Jain, Somshubra Majumdar, Stefania Alborghetti, Syeda Nahida Akter, Terry Kong, Tim Moon, Tomasz Hliwiak, Tomer Asida, Tony Wang, Twinkle Vashishth, Tyler Poon, Udi Karpas, Vahid Noroozi, Venkat Srinivasan, Vijay Korthikanti, Vikram Fugro, Vineeth Kalluru, Vitaly Kurin, Vitaly Lavrukhin, Wasi Uddin Ahmad, Wei Du, Wonmin Byeon, Ximing Lu, Xin Dong, Yashaswi Karnati, Yejin Choi, Yian Zhang, Ying Lin, Yonggan Fu, Yoshi Suhara, Zhen Dong, Zhiyu Li, Zhongbo Zhu, Zijia Chen
We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model
designed to increase throughput for reasoning workloads while achieving
state-of-the-art accuracy compared to similarly-sized models.
Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the
majority of the self-attention layers in the common Transformer architecture
are replaced with Mamba-2 layers, to achieve improved inference speed when
generating the long thinking traces needed for reasoning. We create
Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model
(Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe.
After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to
compress and distill the model with the goal of enabling inference on up to
128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision).
Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that
Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks
while achieving up to 6x higher inference throughput in reasoning settings like
8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2,
Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with
the majority of our pre- and post-training datasets on Hugging Face.
☆ Knowledge Graph-Infused Fine-Tuning for Structured Reasoning in Large Language Models
This paper addresses the problems of missing reasoning chains and
insufficient entity-level semantic understanding in large language models when
dealing with tasks that require structured knowledge. It proposes a fine-tuning
algorithm framework based on knowledge graph injection. The method builds on
pretrained language models and introduces structured graph information for
auxiliary learning. A graph neural network is used to encode entities and their
relations, constructing a graph-based semantic representation. A fusion
mechanism is then designed to jointly model the knowledge graph embeddings with
the contextual representations from the language model. To enhance the
robustness of knowledge integration, a gating mechanism is introduced to
dynamically balance the contributions of linguistic semantics and structural
knowledge. This effectively mitigates conflicts between different
representational spaces. During training, a joint loss function is constructed
to account for both task performance and structural alignment objectives. This
helps improve the accuracy of entity prediction and semantic reasoning. The
study also includes a series of systematic sensitivity experiments. It
evaluates the effects of learning rate, graph coverage, and structural
perturbations on model performance. The results further validate the
effectiveness and stability of the proposed method across tasks such as entity
recognition, question answering, and language generation. Experimental findings
show that the proposed structure-aware fine-tuning framework significantly
enhances the model's ability to represent complex semantic units. It
demonstrates better semantic consistency and contextual logic modeling in
scenarios involving structural reasoning and entity extraction.
☆ Cognitive Surgery: The Awakening of Implicit Territorial Awareness in LLMs
Large language models (LLMs) have been shown to possess a degree of
self-recognition capability-the ability to identify whether a given text was
generated by themselves. Prior work has demonstrated that this capability is
reliably expressed under the Pair Presentation Paradigm (PPP), where the model
is presented with two texts and asked to choose which one it authored. However,
performance deteriorates sharply under the Individual Presentation Paradigm
(IPP), where the model is given a single text to judge authorship. Although
this phenomenon has been observed, its underlying causes have not been
systematically analyzed. In this paper, we first replicate existing findings to
confirm that LLMs struggle to distinguish self- from other-generated text under
IPP. We then investigate the reasons for this failure and attribute it to a
phenomenon we term Implicit Territorial Awareness (ITA)-the model's latent
ability to distinguish self- and other-texts in representational space, which
remains unexpressed in its output behavior. To awaken the ITA of LLMs, we
propose Cognitive Surgery (CoSur), a novel framework comprising four main
modules: representation extraction, territory construction, authorship
discrimination and cognitive editing. Experimental results demonstrate that our
proposed method improves the performance of three different LLMs in the IPP
scenario, achieving average accuracies of 83.25%, 66.19%, and 88.01%,
respectively.
☆ DEPTH: Hallucination-Free Relation Extraction via Dependency-Aware Sentence Simplification and Two-tiered Hierarchical Refinement
Relation extraction enables the construction of structured knowledge for many
downstream applications. While large language models (LLMs) have shown great
promise in this domain, most existing methods concentrate on relation
classification, which predicts the semantic relation type between a related
entity pair. However, we observe that LLMs often struggle to reliably determine
whether a relation exists, especially in cases involving complex sentence
structures or intricate semantics, which leads to spurious predictions. Such
hallucinations can introduce noisy edges in knowledge graphs, compromising the
integrity of structured knowledge and downstream reliability. To address these
challenges, we propose DEPTH, a framework that integrates Dependency-aware
sEntence simPlification and Two-tiered Hierarchical refinement into the
relation extraction pipeline. Given a sentence and its candidate entity pairs,
DEPTH operates in two stages: (1) the Grounding module extracts relations for
each pair by leveraging their shortest dependency path, distilling the sentence
into a minimal yet coherent relational context that reduces syntactic noise
while preserving key semantics; (2) the Refinement module aggregates all local
predictions and revises them based on a holistic understanding of the sentence,
correcting omissions and inconsistencies. We further introduce a
causality-driven reward model that mitigates reward hacking by disentangling
spurious correlations, enabling robust fine-tuning via reinforcement learning
with human feedback. Experiments on six benchmarks demonstrate that DEPTH
reduces the average hallucination rate to 7.0\% while achieving a 17.2\%
improvement in average F1 score over state-of-the-art baselines.
☆ Credence Calibration Game? Calibrating Large Language Models through Structured Play
As Large Language Models (LLMs) are increasingly deployed in
decision-critical domains, it becomes essential to ensure that their confidence
estimates faithfully correspond to their actual correctness. Existing
calibration methods have primarily focused on post-hoc adjustments or auxiliary
model training; however, many of these approaches necessitate additional
supervision or parameter updates. In this work, we propose a novel prompt-based
calibration framework inspired by the Credence Calibration Game. Our method
establishes a structured interaction loop wherein LLMs receive feedback based
on the alignment of their predicted confidence with correctness. Through
feedback-driven prompting and natural language summaries of prior performance,
our framework dynamically improves model calibration. Extensive experiments
across models and game configurations demonstrate consistent improvements in
evaluation metrics. Our results highlight the potential of game-based prompting
as an effective strategy for LLM calibration. Code and data are available at
https://anonymous.4open.science/r/LLM-Calibration/.
☆ ZPD-SCA: Unveiling the Blind Spots of LLMs in Assessing Students' Cognitive Abilities
Wenhan Dong, Zhen Sun, Yuemeng Zhao, Zifan Peng, Jun Wu, Jingyi Zheng, Yule Liu, Xinlei He, Yu Wang, Ruiming Wang, Xinyi Huang, Lei Mo
Large language models (LLMs) have demonstrated potential in educational
applications, yet their capacity to accurately assess the cognitive alignment
of reading materials with students' developmental stages remains insufficiently
explored. This gap is particularly critical given the foundational educational
principle of the Zone of Proximal Development (ZPD), which emphasizes the need
to match learning resources with Students' Cognitive Abilities (SCA). Despite
the importance of this alignment, there is a notable absence of comprehensive
studies investigating LLMs' ability to evaluate reading comprehension
difficulty across different student age groups, especially in the context of
Chinese language education. To fill this gap, we introduce ZPD-SCA, a novel
benchmark specifically designed to assess stage-level Chinese reading
comprehension difficulty. The benchmark is annotated by 60 Special Grade
teachers, a group that represents the top 0.15% of all in-service teachers
nationwide. Experimental results reveal that LLMs perform poorly in zero-shot
learning scenarios, with Qwen-max and GLM even falling below the probability of
random guessing. When provided with in-context examples, LLMs performance
improves substantially, with some models achieving nearly double the accuracy
of their zero-shot baselines. These results reveal that LLMs possess emerging
abilities to assess reading difficulty, while also exposing limitations in
their current training for educationally aligned judgment. Notably, even the
best-performing models display systematic directional biases, suggesting
difficulties in accurately aligning material difficulty with SCA. Furthermore,
significant variations in model performance across different genres underscore
the complexity of task. We envision that ZPD-SCA can provide a foundation for
evaluating and improving LLMs in cognitively aligned educational applications.
☆ ISCA: A Framework for Interview-Style Conversational Agents
Charles Welch, Allison Lahnala, Vasudha Varadarajan, Lucie Flek, Rada Mihalcea, J. Lomax Boyd, João Sedoc
We present a low-compute non-generative system for implementing
interview-style conversational agents which can be used to facilitate
qualitative data collection through controlled interactions and quantitative
analysis. Use cases include applications to tracking attitude formation or
behavior change, where control or standardization over the conversational flow
is desired. We show how our system can be easily adjusted through an online
administrative panel to create new interviews, making the tool accessible
without coding. Two case studies are presented as example applications, one
regarding the Expressive Interviewing system for COVID-19 and the other a
semi-structured interview to survey public opinion on emerging neurotechnology.
Our code is open-source, allowing others to build off of our work and develop
extensions for additional functionality.
☆ Beyond Semantic Similarity: Reducing Unnecessary API Calls via Behavior-Aligned Retriever
Tool-augmented large language models (LLMs) leverage external functions to
extend their capabilities, but inaccurate function calls can lead to
inefficiencies and increased costs.Existing methods address this challenge by
fine-tuning LLMs or using demonstration-based prompting, yet they often suffer
from high training overhead and fail to account for inconsistent demonstration
samples, which misguide the model's invocation behavior. In this paper, we
trained a behavior-aligned retriever (BAR), which provides behaviorally
consistent demonstrations to help LLMs make more accurate tool-using decisions.
To train the BAR, we construct a corpus including different function-calling
behaviors, i.e., calling or non-calling.We use the contrastive learning
framework to train the BAR with customized positive/negative pairs and a
dual-negative contrastive loss, ensuring robust retrieval of behaviorally
consistent examples.Experiments demonstrate that our approach significantly
reduces erroneous function calls while maintaining high task performance,
offering a cost-effective and efficient solution for tool-augmented LLMs.
☆ SurveyGen-I: Consistent Scientific Survey Generation with Evolving Plans and Memory-Guided Writing
Survey papers play a critical role in scientific communication by
consolidating progress across a field. Recent advances in Large Language Models
(LLMs) offer a promising solution by automating key steps in the
survey-generation pipeline, such as retrieval, structuring, and summarization.
However, existing LLM-based approaches often struggle with maintaining
coherence across long, multi-section surveys and providing comprehensive
citation coverage. To address these limitations, we introduce SurveyGen-I, an
automatic survey generation framework that combines coarse-to-fine retrieval,
adaptive planning, and memory-guided generation. SurveyGen-I first performs
survey-level retrieval to construct the initial outline and writing plan, and
then dynamically refines both during generation through a memory mechanism that
stores previously written content and terminology, ensuring coherence across
subsections. When the system detects insufficient context, it triggers
fine-grained subsection-level retrieval. During generation, SurveyGen-I
leverages this memory mechanism to maintain coherence across subsections.
Experiments across four scientific domains demonstrate that SurveyGen-I
consistently outperforms previous works in content quality, consistency, and
citation coverage.
comment: The code is available at https://github.com/SurveyGens/SurveyGen-I ,
20 pages, 16 figures
♻ ☆ RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation
We investigate to what extent Multimodal Large Language Models (MLLMs) can
accurately identify the orientation of input images rotated 0{\deg}, 90{\deg},
180{\deg}, and 270{\deg}. This task demands robust visual reasoning
capabilities to detect rotational cues and contextualize spatial relationships
within images, regardless of their orientation. To evaluate MLLMs on these
abilities, we introduce RotBench -- a 350-image manually-filtered benchmark
comprising lifestyle, portrait, and landscape images. Despite the relatively
simple nature of this task, we show that several state-of-the-art open and
proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably
identify rotation in input images. Providing models with auxiliary information
-- including captions, depth maps, and more -- or using chain-of-thought
prompting offers only small and inconsistent improvements. Our results indicate
that most models are able to reliably identify right-side-up (0{\deg}) images,
while certain models are able to identify upside-down (180{\deg}) images. None
can reliably distinguish between 90{\deg} and 270{\deg}. Simultaneously showing
the image rotated in different orientations leads to moderate performance gains
for reasoning models, while a modified setup using voting improves the
performance of weaker models. We further show that fine-tuning does not improve
models' ability to distinguish 90{\deg} and 270{\deg} rotations, despite
substantially improving the identification of 180{\deg} images. Together, these
results reveal a significant gap between MLLMs' spatial reasoning capabilities
and human perception in identifying rotation.
comment: 20 pages. Code and data: https://github.com/tianyiniu/RotBench
♻ ☆ Task-Oriented Automatic Fact-Checking with Frame-Semantics
We propose a novel paradigm for automatic fact-checking that leverages frame
semantics to enhance the structured understanding of claims and guide the
process of fact-checking them. To support this, we introduce a pilot dataset of
real-world claims extracted from PolitiFact, specifically annotated for
large-scale structured data. This dataset underpins two case studies: the first
investigates voting-related claims using the Vote semantic frame, while the
second explores various semantic frames based on data sources from the
Organisation for Economic Co-operation and Development (OECD). Our findings
demonstrate the effectiveness of frame semantics in improving evidence
retrieval and explainability for fact-checking. Finally, we conducted a survey
of frames evoked in fact-checked claims, identifying high-impact frames to
guide future work in this direction.
♻ ☆ Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli
Synthetic data generation has recently emerged as a promising approach for
enhancing the capabilities of large language models (LLMs) without the need for
expensive human annotations. However, existing methods often generate data that
can be low quality or contrived. In this paper, we introduce Source2Synth, a
scalable approach for synthetic data generation and curation that is grounded
in real-world data sources. Source2Synth takes as input a custom data source
and produces synthetic data examples with intermediate reasoning steps. Our
method improves the dataset quality by discarding low-quality generations based
on their answerability. We demonstrate the generality of this approach by
applying it to two tasks that leverage two different types of data: multi-hop
question answering (MHQA), where we test complex reasoning abilities leveraging
documents, and tabular question answering (TQA), where we test tool usage
leveraging tables. Our method improves performance by 25.51% for TQA on WikiSQL
and 22.57% for MHQA on HotpotQA compared to the fine-tuned baselines.
♻ ☆ JudgeLRM: Large Reasoning Models as a Judge
The rise of Large Language Models (LLMs) as evaluators offers a scalable
alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for
judges approaches often fall short in domains requiring complex reasoning. In
this work, we investigate whether LLM judges truly benefit from enhanced
reasoning capabilities. Through a detailed analysis of reasoning requirements
across evaluation tasks, we reveal a negative correlation between SFT
performance gains and the proportion of reasoning-demanding samples -
highlighting the limitations of SFT in such scenarios. To address this, we
introduce JudgeLRM, a family of judgment-oriented LLMs trained using
reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM
models consistently outperform both SFT-tuned and state-of-the-art reasoning
models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms
DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks
requiring deep reasoning.
comment: Preprint
♻ ☆ TASER: Table Agents for Schema-guided Extraction and Recommendation
Real-world financial documents report essential information about an entity's
financial holdings that can span millions of different financial instrument
types. Yet, these details are often buried in messy, multi-page, fragmented
tables - for example, 99.4% of the tables in our dataset have no bounding boxes
with the maximum number of rows amounting to 426 per table across 44 pages. To
tackle these unique challenges from real-world tables, we present a
continuously learning, agentic table extraction system, TASER (Table Agents for
Schema-guided Extraction and Recommendation) that extracts highly unstructured,
multi-page, heterogeneous tables into normalized, schema-conforming outputs.
Our table agents execute on table detection, classification, extraction, and
recommendations by leveraging an initial schema. Then, our Recommender Agent
reviews the outputs, recommends schema revisions, and decides on the final
recommendations, enabling TASER to outperform existing table detection models
such as Table Transformer by 10.1%. Within this continuous learning process, we
highlight that larger batch sizes result in a 104.3% increase in schema
recommendations that are actionable and utilized, resulting in a 9.8% increase
in extracted holdings - highlighting the importance of a continuous learning
process. To train TASER, we have manually labeled 22,584 pages (28,150,449
tokens), 3,213 tables for $731,685,511,687 of holdings culminating in one of
the first real financial table datasets. We release our dataset TASERTab to
enable the research community to access real-world financial tables and
outputs. Our results highlight the promise of agentic, schema-guided extraction
systems for robust understanding of real-world financial tables.
comment: Withdrawn due to missing key sections in the paper
♻ ☆ G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model
Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, Lingpeng Kong
Large language models (LLMs) have shown remarkable proficiency in human-level
reasoning and generation capabilities, which encourages extensive research on
their application in mathematical problem solving. However, current work has
been largely focused on text-based mathematical problems, with limited
investigation in problems involving geometric information. Addressing this gap,
we aim to enable LLMs to solve geometric problems by understanding image input.
We first analyze the limitations of current Multimodal Large Language Models
(MLLMs) in this area: they struggle to accurately comprehending basic geometric
elements and their relationships. To overcome these challenges, we take
advantage of the unique characteristics of geometric problems (such as unique
geometric logical form, and geometric scalability) and the capacity of the
textual LLMs to build an enriched multimodal geometry dataset based on existing
data. The augmented dataset, Geo170K, contains more than 170K geometric
image-caption and question-answer pairs. Utilizing our constructed Geo170K
dataset, we develop G-LLaVA, which demonstrates exceptional performance in
solving geometric problems, significantly outperforming GPT-4-V on the
MathVista benchmark with only 7B parameters.
comment: 10 pages
♻ ☆ Coupling without Communication and Drafter-Invariant Speculative Decoding
Suppose Alice has a distribution $P$ and Bob has a distribution $Q$. Alice
wants to draw a sample $a\sim P$ and Bob a sample $b \sim Q$ such that $a = b$
with as high of probability as possible. It is well-known that, by sampling
from an optimal coupling between the distributions, Alice and Bob can achieve
$\Pr[a = b] = 1 - D_{TV}(P,Q)$, where $D_{TV}(P,Q)$ is the total variation
distance between $P$ and $Q$. What if Alice and Bob must solve this same
problem \emph{without communicating at all?} Perhaps surprisingly, with access
to public randomness, they can still achieve $\Pr[a = b] \geq \frac{1 -
D_{TV}(P,Q)}{1 + D_{TV}(P,Q)} \geq 1-2D_{TV}(P,Q)$ using a simple protocol
based on the Weighted MinHash algorithm. This bound was shown to be optimal in
the worst-case by [Bavarian et al., 2020]. In this work, we revisit the
communication-free coupling problem. We provide a simpler proof of the
optimality result from [Bavarian et al., 2020]. We show that, while the
worst-case success probability of Weighted MinHash cannot be improved, an
equally simple protocol based on Gumbel sampling offers a Pareto improvement:
for every pair of distributions $P, Q$, Gumbel sampling achieves an equal or
higher value of $\Pr[a = b]$ than Weighted MinHash. Importantly, this
improvement translates to practice. We demonstrate an application of
communication-free coupling to \emph{speculative decoding}, a recent method for
accelerating autoregressive large language models [Leviathan, Kalman, Matias,
ICML 2023]. We show that communication-free protocols can be used to contruct
\emph{\CSD{}} schemes, which have the desirable property that their output is
fixed given a fixed random seed, regardless of what drafter is used for
speculation. In experiments on a language generation task, Gumbel sampling
outperforms Weighted MinHash. Code is available at
https://github.com/majid-daliri/DISD.
comment: 18 pages
♻ ☆ Boosting Chart-to-Code Generation in MLLM via Dual Preference-Guided Refinement ACM MM 2025
Translating chart images into executable plotting scripts-referred to as the
chart-to-code generation task-requires Multimodal Large Language Models (MLLMs)
to perform fine-grained visual parsing, precise code synthesis, and robust
cross-modal reasoning. However, this task is inherently under-constrained:
multiple valid code implementations can produce the same visual chart, and
evaluation must consider both code correctness and visual fidelity across
diverse dimensions. This makes it difficult to learn accurate and generalizable
mappings through standard supervised fine-tuning. To address these challenges,
we propose a dual preference-guided refinement framework that combines a
feedback-driven, dual-modality reward mechanism with iterative preference
learning. Our approach introduces a structured variant generation strategy and
a visual reward model to efficiently produce high-quality, aspect-aware
preference pairs-making preference collection scalable and supervision more
targeted. These preferences are used in an offline reinforcement learning setup
to optimize the model toward multi-dimensional fidelity. Experimental results
show that our framework significantly enhances the performance of
general-purpose open-source MLLMs, enabling them to generate high-quality
plotting code that rivals specialized chart-centric models and even some
proprietary systems. The code and datasets are publicly available at
https://github.com/Zhihan72/Chart2Code.
comment: Accepted by ACM MM 2025
♻ ☆ Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers
Hallucinations are a persistent problem with Large Language Models (LLMs). As
these models become increasingly used in high-stakes domains, such as
healthcare and finance, the need for effective hallucination detection is
crucial. To this end, we outline a versatile framework for zero-resource
hallucination detection that practitioners can apply to real-world use cases.
To achieve this, we adapt a variety of existing uncertainty quantification (UQ)
techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge,
transforming them as necessary into standardized response-level confidence
scores ranging from 0 to 1. To enhance flexibility, we propose a tunable
ensemble approach that incorporates any combination of the individual
confidence scores. This approach enables practitioners to optimize the ensemble
for a specific use case for improved performance. To streamline implementation,
the full suite of scorers is offered in this paper's companion Python toolkit,
UQLM. To evaluate the performance of the various scorers, we conduct an
extensive set of experiments using several LLM question-answering benchmarks.
We find that our tunable ensemble typically surpasses its individual components
and outperforms existing hallucination detection methods. Our results
demonstrate the benefits of customized hallucination detection strategies for
improving the accuracy and reliability of LLMs.
comment: UQLM repository: https://github.com/cvs-health/uqlm
♻ ☆ Hallucinations and Key Information Extraction in Medical Texts: A Comprehensive Assessment of Open-Source Large Language Models
Clinical summarization is crucial in healthcare as it distills complex
medical data into digestible information, enhancing patient understanding and
care management. Large language models (LLMs) have shown significant potential
in automating and improving the accuracy of such summarizations due to their
advanced natural language understanding capabilities. These models are
particularly applicable in the context of summarizing medical/clinical texts,
where precise and concise information transfer is essential. In this paper, we
investigate the effectiveness of open-source LLMs in extracting key events from
discharge reports, including admission reasons, major in-hospital events, and
critical follow-up actions. In addition, we also assess the prevalence of
various types of hallucinations in the summaries produced by these models.
Detecting hallucinations is vital as it directly influences the reliability of
the information, potentially affecting patient care and treatment outcomes. We
conduct comprehensive simulations to rigorously evaluate the performance of
these models, further probing the accuracy and fidelity of the extracted
content in clinical summarization. Our results reveal that while the LLMs
(e.g., Qwen2.5 and DeepSeek-v2) perform quite well in capturing admission
reasons and hospitalization events, they are generally less consistent when it
comes to identifying follow-up recommendations, highlighting broader challenges
in leveraging LLMs for comprehensive summarization.
♻ ☆ From Autonomy to Agency: Agentic Vehicles for Human-Centered Mobility Systems
Autonomy, from the Greek autos (self) and nomos (law), refers to the capacity
to operate according to internal rules without external control. Accordingly,
autonomous vehicles (AuVs) are viewed as vehicular systems capable of
perceiving their environment and executing pre-programmed tasks independently
of external input. However, both research and real-world deployments
increasingly showcase vehicles that demonstrate behaviors beyond this
definition (including the SAE levels 0 to 5); Examples of this outpace include
the interaction with humans with natural language, goal adaptation, contextual
reasoning, external tool use, and unseen ethical dilemma handling, largely
empowered by multi-modal large language models (LLMs). These developments
reveal a conceptual gap between technical autonomy and the broader cognitive
and social capabilities needed for future human-centered mobility systems. To
address this gap, this paper introduces the concept of agentic vehicles (AgVs),
referring to vehicles that integrate agentic AI systems to reason, adapt, and
interact within complex environments. This paper proposes the term AgVs and
their distinguishing characteristics from conventional AuVs. It synthesizes
relevant advances in integrating LLMs and AuVs and highlights how AgVs might
transform future mobility systems and ensure the systems are human-centered.
The paper concludes by identifying key challenges in the development and
governance of AgVs, and how they can play a significant role in future agentic
transportation systems.
♻ ☆ Is neural semantic parsing good at ellipsis resolution, or isn't it?
Neural semantic parsers have shown good overall performance for a variety of
linguistic phenomena, reaching semantic matching scores of more than 90%. But
how do such parsers perform on strongly context-sensitive phenomena, where
large pieces of semantic information need to be duplicated to form a meaningful
semantic representation? A case in point is English verb phrase ellipsis, a
construct where entire verb phrases can be abbreviated by a single auxiliary
verb. Are the otherwise known as powerful semantic parsers able to deal with
ellipsis or aren't they? We constructed a corpus of 120 cases of ellipsis with
their fully resolved meaning representation and used this as a challenge set
for a large battery of neural semantic parsers. Although these parsers
performed very well on the standard test set, they failed in the instances with
ellipsis. Data augmentation helped improve the parsing results. The reason for
the difficulty of parsing elided phrases is not that copying semantic material
is hard, but that usually occur in linguistically complicated contexts causing
most of the parsing errors.
comment: Accepted by 16th IWCS
♻ ☆ Retrieval-Augmented Semantic Parsing: Improving Generalization with Lexical Knowledge
Open-domain semantic parsing remains a challenging task, as neural models
often rely on heuristics and struggle to handle unseen concepts. In this paper,
we investigate the potential of large language models (LLMs) for this task and
introduce Retrieval-Augmented Semantic Parsing (RASP), a simple yet effective
approach that integrates external symbolic knowledge into the parsing process.
Our experiments not only show that LLMs outperform previous encoder-decoder
baselines for semantic parsing, but that RASP further enhances their ability to
predict unseen concepts, nearly doubling the performance of previous models on
out-of-distribution concepts. These findings highlight the promise of
leveraging large language models and retrieval mechanisms for robust and
open-domain semantic parsing.
comment: Accpted by 16th IWCS
♻ ☆ Applying Text Embedding Models for Efficient Analysis in Labeled Property Graphs
Labeled property graphs often contain rich textual attributes that can
enhance analytical tasks when properly leveraged. This work explores the use of
pretrained text embedding models to enable efficient semantic analysis in such
graphs. By embedding textual node and edge properties, we support downstream
tasks including node classification and relation prediction with improved
contextual understanding. Our approach integrates language model embeddings
into the graph pipeline without altering its structure, demonstrating that
textual semantics can significantly enhance the accuracy and interpretability
of property graph analysis.
♻ ☆ STEM: Efficient Relative Capability Evaluation of LLMs through Structured Transition Samples AAAI 2026
Evaluating large language models (LLMs) has become increasingly challenging
as model capabilities advance rapidly. While recent models often achieve higher
scores on standard benchmarks, these improvements do not consistently reflect
enhanced real-world reasoning capabilities. Moreover, widespread overfitting to
public benchmarks and the high computational cost of full evaluations have made
it both expensive and less effective to distinguish meaningful differences
between models. To address these challenges, we propose the \textbf{S}tructured
\textbf{T}ransition \textbf{E}valuation \textbf{M}ethod (STEM), a lightweight
and interpretable evaluation framework for efficiently estimating the relative
capabilities of LLMs. STEM identifies \textit{significant transition samples}
(STS) by analyzing consistent performance transitions among LLMs of the same
architecture but varying parameter scales. These samples enable STEM to
effectively estimate the capability position of an unknown model. Qwen3 model
family is applied to construct the STS pool on six diverse and representative
benchmarks. To assess generalizability. Experimental results indicate that STEM
reliably captures performance trends, aligns with ground-truth rankings of
model capability. These findings highlight STEM as a practical and scalable
method for fine-grained, architecture-agnostic evaluation of LLMs.
comment: Submit to AAAI 2026
♻ ☆ Advancing Language Multi-Agent Learning with Credit Re-Assignment for Interactive Environment Generalization
LLM-based agents have made significant advancements in interactive
environments, such as mobile operations and web browsing, and other domains
beyond computer using. Current multi-agent systems universally excel in
performance, compared to single agents, but struggle with generalization across
environments due to predefined roles and inadequate strategies for generalizing
language agents. The challenge of achieving both strong performance and good
generalization has hindered the progress of multi-agent systems for interactive
environments. To address these issues, we propose CollabUIAgents, a multi-agent
reinforcement learning framework with a novel multi-agent credit re-assignment
(CR) strategy, assigning process rewards with LLMs rather than
environment-specific rewards and learning with synthesized preference data, in
order to foster generalizable, collaborative behaviors among the role-free
agents' policies. Empirical results show that our framework improves both
performance and cross-environment generalizability of multi-agent systems.
Moreover, our 7B-parameter system achieves results on par with or exceed strong
closed-source models, and the LLM that guides the CR. We also provide insights
in using granular CR rewards effectively for environment generalization, and
accommodating trained LLMs in multi-agent systems.
comment: Published in COLM2025
♻ ☆ Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
Recent advances in reinforcement learning (RL) with numerical feedback, such
as scalar rewards, have significantly enhanced the complex reasoning
capabilities of large language models (LLMs). Despite this success, we identify
three key challenges encountered by RL with solely numerical feedback:
performance plateaus, limited effectiveness of spontaneous self-reflection, and
persistent failures. We then demonstrate that RL-finetuned models, even after
exhibiting performance plateaus, can generate correct refinements on
persistently failed problems by leveraging natural language feedback in the
form of critiques. Building on this insight, we propose Critique-GRPO, an
online RL framework that integrates both natural language and numerical
feedback for effective policy optimization. Critique-GRPO enables LLMs to learn
from initial responses and critique-guided self-refinements simultaneously
while maintaining exploration. Additionally, we employ a shaping function to
amplify learning from correct, especially unfamiliar, refinements and penalize
incorrect ones. Extensive experiments with Qwen2.5-7B-Base,
Qwen2.5-Math-7B-Base, and Qwen3-8B demonstrate that Critique-GRPO consistently
outperforms supervised learning and RL-based fine-tuning methods across eight
challenging mathematical, STEM, and general reasoning tasks. Specifically,
Critique-GRPO improves average pass@1 scores across all compared methods by
approximately +4.4% on Qwen2.5-7B-Base and +3.8% on Qwen3-8B. Notably,
Critique-GRPO enables effective self-improvement through self-critiquing,
achieving significant gains over GRPO, e.g., +16.7% pass@1 improvement on AIME
2024.
comment: 52 pages, updated with new experimental results and implementation
details
♻ ☆ Enhancing Temporal Sensitivity of Large Language Model for Recommendation with Counterfactual Tuning
Recent advances have applied large language models (LLMs) to sequential
recommendation, leveraging their pre-training knowledge and reasoning
capabilities to provide more personalized user experiences. However, existing
LLM-based methods fail to sufficiently leverage the rich temporal information
inherent in users' historical interaction sequences, stemming from fundamental
architectural constraints: LLMs process information through self-attention
mechanisms that lack inherent sequence ordering and rely on position embeddings
designed primarily for natural language rather than user interaction sequences.
This limitation significantly impairs their ability to capture the evolution of
user preferences over time and predict future interests accurately.
To address this critical gap, we propose \underline{C}ounterfactual
\underline{E}nhanced \underline{T}emporal Framework for LLM-Based
\underline{Rec}ommendation (CETRec). CETRec is grounded in causal inference
principles, which allow it to isolate and measure the specific impact of
temporal information on recommendation outcomes. Combined with our
counterfactual tuning task derived from causal analysis, CETRec effectively
enhances LLMs' awareness of both absolute order (how recently items were
interacted with) and relative order (the sequential relationships between
items). Extensive experiments on real-world datasets demonstrate the
effectiveness of our CETRec. Our code is available at
https://anonymous.4open.science/r/CETRec-B9CE/.
♻ ☆ CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description
Recent advances in large language models (LLMs) have significantly improved
the accuracy of Text-to-SQL systems. However, a critical challenge remains: the
semantic mismatch between natural language questions (NLQs) and their
corresponding SQL queries. This issue is exacerbated in large-scale databases,
where semantically similar attributes hinder schema linking and semantic drift
during SQL generation, ultimately reducing model accuracy. To address these
challenges, we introduce CRED-SQL, a framework designed for large-scale
databases that integrates Cluster Retrieval and Execution Description. CRED-SQL
first performs cluster-based large-scale schema retrieval to pinpoint the
tables and columns most relevant to a given NLQ, alleviating schema mismatch.
It then introduces an intermediate natural language representation-Execution
Description Language (EDL)-to bridge the gap between NLQs and SQL. This
reformulation decomposes the task into two stages: Text-to-EDL and EDL-to-SQL,
leveraging LLMs' strong general reasoning capabilities while reducing semantic
deviation. Extensive experiments on two large-scale, cross-domain
benchmarks-SpiderUnion and BirdUnion-demonstrate that CRED-SQL achieves new
state-of-the-art (SOTA) performance, validating its effectiveness and
scalability. Our code is available at https://github.com/smduan/CRED-SQL.git
♻ ☆ ReSpark: Leveraging Previous Data Reports as References to Generate New Reports with LLMs
Yuan Tian, Chuhan Zhang, Xiaotong Wang, Sitong Pan, Weiwei Cui, Haidong Zhang, Dazhen Deng, Yingcai Wu
Creating data reports is a labor-intensive task involving iterative data
exploration, insight extraction, and narrative construction. A key challenge
lies in composing the analysis logic-from defining objectives and transforming
data to identifying and communicating insights. Manually crafting this logic
can be cognitively demanding. While experienced analysts often reuse scripts
from past projects, finding a perfect match for a new dataset is rare. Even
when similar analyses are available online, they usually share only results or
visualizations, not the underlying code, making reuse difficult. To address
this, we present ReSpark, a system that leverages large language models (LLMs)
to reverse-engineer analysis logic from existing reports and adapt it to new
datasets. By generating draft analysis steps, ReSpark provides a warm start for
users. It also supports interactive refinement, allowing users to inspect
intermediate outputs, insert objectives, and revise content. We evaluate
ReSpark through comparative and user studies, demonstrating its effectiveness
in lowering the barrier to generating data reports without relying on existing
analysis code.
♻ ☆ Social Debiasing for Fair Multi-modal LLMs
Multi-modal Large Language Models (MLLMs) have dramatically advanced the
research field and delivered powerful vision-language understanding
capabilities. However, these models often inherit deep-rooted social biases
from their training data, leading to uncomfortable responses with respect to
attributes such as race and gender. This paper addresses the issue of social
biases in MLLMs by i) introducing a comprehensive counterfactual dataset with
multiple social concepts (CMSC), which complements existing datasets by
providing 18 diverse and balanced social concepts; and ii) proposing a
counter-stereotype debiasing (CSD) strategy that mitigates social biases in
MLLMs by leveraging the opposites of prevalent stereotypes. CSD incorporates
both a novel bias-aware data sampling method and a loss rescaling method,
enabling the model to effectively reduce biases. We conduct extensive
experiments with four prevalent MLLM architectures. The results demonstrate the
advantage of the CMSC dataset and the edge of CSD strategy in reducing social
biases compared to existing competing methods, without compromising the overall
performance on general multi-modal reasoning benchmarks.
comment: Project page:
https://github.com/xaCheng1996/Social_Debiasing_For_Fair_MLLMs
♻ ☆ FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation
Yutong Liu, Ziyue Zhang, Ban Ma-bao, Yuqing Cai, Yongbin Yu, Renzeng Duojie, Xiangxiang Wang, Fan Gao, Cheng Huang, Nyima Tashi
Tibetan is a low-resource language with minimal parallel speech corpora
spanning its three major dialects-\"U-Tsang, Amdo, and Kham-limiting progress
in speech modeling. To address this issue, we propose FMSD-TTS, a few-shot,
multi-speaker, multi-dialect text-to-speech framework that synthesizes parallel
dialectal speech from limited reference audio and explicit dialect labels. Our
method features a novel speaker-dialect fusion module and a Dialect-Specialized
Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and
linguistic variations across dialects while preserving speaker identity.
Extensive objective and subjective evaluations demonstrate that FMSD-TTS
significantly outperforms baselines in both dialectal expressiveness and
speaker similarity. We further validate the quality and utility of the
synthesized speech through a challenging speech-to-speech dialect conversion
task. Our contributions include: (1) a novel few-shot TTS system tailored for
Tibetan multi-dialect speech synthesis, (2) the public release of a large-scale
synthetic Tibetan speech corpus generated by FMSD-TTS, and (3) an open-source
evaluation toolkit for standardized assessment of speaker similarity, dialect
consistency, and audio quality.
comment: 18 pages
♻ ☆ Each to Their Own: Exploring the Optimal Embedding in RAG
Recently, as Large Language Models (LLMs) have fundamentally impacted various
fields, the methods for incorporating up-to-date information into LLMs or
adding external knowledge to construct domain-specific models have garnered
wide attention. Retrieval-Augmented Generation (RAG), serving as an
inference-time scaling method, is notable for its low cost and minimal effort
for parameter tuning. However, due to heterogeneous training data and model
architecture, the variant embedding models used in RAG exhibit different
benefits across various areas, often leading to different similarity
calculation results and, consequently, varying response quality from LLMs. To
address this problem, we propose and examine two approaches to enhance RAG by
combining the benefits of multiple embedding models, named Mixture-Embedding
RAG and Confident RAG. Mixture-Embedding RAG simply sorts and selects
retrievals from multiple embedding models based on standardized similarity;
however, it does not outperform vanilla RAG. In contrast, Confident RAG
generates responses multiple times using different embedding models and then
selects the responses with the highest confidence level, demonstrating average
improvements of approximately 10% and 5% over vanilla LLMs and RAG,
respectively. The consistent results across different LLMs and embedding models
indicate that Confident RAG is an efficient plug-and-play approach for various
domains. We will release our code upon publication.
♻ ☆ Input Time Scaling
Current Large Language Models (LLMs) are usually post-trained on large-scale
carefully curated datasets (data & training scaling) and doing reasoning in
test time (inference time scaling). In this work, we present a new scaling
paradigm, Input Time Scaling, to complement previous scaling methods by putting
resources on queries (input time). During training and testing, we combine
meta-knowledge from LLMs to refine inputs with different strategies. We also
find a new phenomenon, training-testing co-design there. We need to apply query
strategies during both training and testing. Only applying strategies on
training or testing would seriously degrade the performance. We are also
surprised to find that seemingly low data quality datasets can gain high
performance. Adding irrelevant information to the queries, randomly selecting
examples from a minimally filtered dataset, can even perform the best. These
findings contradict the widely held inductive bias, "garbage in, garbage out".
Curating datasets with seemingly high-quality data can even potentially limit
the performance ceiling. In addition, models trained on more data with similar
quality (15k VS 1k) perform worse, simple dataset size scaling should also be
carefully inspected. The good news is that our findings are compatible with the
Less is More phenomenon. A small set of examples is enough to evoke high-level
reasoning ability. With experiments on models trained on Qwen2.5-32B-Instruct,
we are able to reach SOTA performance among 32B models on AIME24(76.7%) and
AIME25(76.7%) pass@1. We can further achieve AIME24(76.7%) and AIME25(80%) with
a majority vote of three models. Starting from DeepSeek-R1-Distill-Qwen-32B,
the best result would be 86.7% on AIME24 and 76.7% on AIME25. To facilitate
reproducibility and further research, we are working on open-source our
datasets, data pipelines, evaluation results, and checkpoints.
♻ ☆ Investigating Transcription Normalization in the Faetar ASR Benchmark
We examine the role of transcription inconsistencies in the Faetar Automatic
Speech Recognition benchmark, a challenging low-resource ASR benchmark. With
the help of a small, hand-constructed lexicon, we conclude that find that,
while inconsistencies do exist in the transcriptions, they are not the main
challenge in the task. We also demonstrate that bigram word-based language
modelling is of no added benefit, but that constraining decoding to a finite
lexicon can be beneficial. The task remains extremely difficult.
♻ ☆ Deliberate Reasoning in Language Models as Structure-Aware Planning with an Accurate World Model ACL25
Enhancing the reasoning capabilities of language models (LMs) remains a key
challenge, especially for tasks that require complex, multi-step
decision-making where existing Chain-of-Thought (CoT) approaches struggle with
consistency and verification. In this paper, we propose a novel reasoning
framework, referred to as Structure-aware Planning with an Accurate World Model
(SWAP), that integrates structured knowledge representation with learned
planning. Unlike prior methods that rely purely on natural language reasoning,
SWAP leverages entailment graphs to encode structured dependencies and enable
symbolic verification of intermediate steps. To systematically construct and
update the graph, SWAP employs a policy model to propose candidate expansions
and a world model to predict structural updates. To improve accuracy, the world
model generates multiple alternative updates, and a discriminator re-ranks them
based on plausibility. To encourage diverse exploration, we introduce
Diversity-based Modelling (DM), which samples candidates from the remaining
probability mass after removing previously sampled candidates from the original
policy distribution. Additionally, SWAP improves the discrimination accuracy
through Contrastive Ranking (CR), which directly compares candidates within
prompts and incorporates meta-knowledge to improve ranking quality. We evaluate
SWAP across diverse reasoning-intensive benchmarks including math reasoning,
logical reasoning, and coding tasks. Extensive experiments demonstrate that
SWAP significantly improves upon the base models and consistently outperforms
existing reasoning methods.
comment: ACL25 (main)
♻ ☆ Chain of Correction for Full-text Speech Recognition with Large Language Models
Full-text error correction with Large Language Models (LLMs) for Automatic
Speech Recognition (ASR) is attracting increased attention for its ability to
address a wide range of error types, such as punctuation restoration and
inverse text normalization, across long context. However, challenges remain
regarding stability, controllability, completeness, and fluency. To mitigate
these issues, this paper proposes the Chain of Correction (CoC), which uses a
multi-turn chat format to correct errors segment by segment, guided by
pre-recognized text and full-text context for better semantic understanding.
Utilizing the open-sourced ChFT dataset, we fine-tune a pre-trained LLM to
evaluate CoC's performance. Experiments show that CoC significantly outperforms
baseline and benchmark systems in correcting full-text ASR outputs. We also
analyze correction thresholds to balance under-correction and over-rephrasing,
extrapolate CoC on extra-long ASR outputs, and explore using other types of
information to guide error correction.
♻ ☆ Enhancing Depression-Diagnosis-Oriented Chat with Psychological State Tracking NLPCC 2025
Depression-diagnosis-oriented chat aims to guide patients in self-expression
to collect key symptoms for depression detection. Recent work focuses on
combining task-oriented dialogue and chitchat to simulate the interview-based
depression diagnosis. Whereas, these methods can not well capture the changing
information, feelings, or symptoms of the patient during dialogues. Moreover,
no explicit framework has been explored to guide the dialogue, which results in
some useless communications that affect the experience. In this paper, we
propose to integrate Psychological State Tracking (POST) within the large
language model (LLM) to explicitly guide depression-diagnosis-oriented chat.
Specifically, the state is adapted from a psychological theoretical model,
which consists of four components, namely Stage, Information, Summary and Next.
We fine-tune an LLM model to generate the dynamic psychological state, which is
further used to assist response generation at each turn to simulate the
psychiatrist. Experimental results on the existing benchmark show that our
proposed method boosts the performance of all subtasks in
depression-diagnosis-oriented chat.
comment: Accepted by NLPCC 2025
♻ ☆ A Little Human Data Goes A Long Way ACL 2025
Faced with an expensive human annotation process, creators of NLP systems
increasingly turn to synthetic data generation. While this method shows
promise, the extent to which synthetic data can replace human annotation is
poorly understood. We investigate the use of synthetic data in Fact
Verification (FV) and Question Answering (QA) by studying the effects of
incrementally replacing human generated data with synthetic points on eight
diverse datasets. Strikingly, replacing up to 90% of the training data only
marginally decreases performance, but replacing the final 10% leads to severe
declines. We find that models trained on purely synthetic data can be reliably
improved by including as few as 125 human generated data points. We show that
matching the performance gain of just a little additional human data (only 200
points) requires an order of magnitude more synthetic data and estimate price
ratios at which human annotation would be a more cost-effective solution. Our
results suggest that even when human annotation at scale is infeasible, there
is great value to having a small proportion of the dataset being human
generated.
comment: ACL 2025
♻ ☆ CRINN: Contrastive Reinforcement Learning for Approximate Nearest Neighbor Search
Approximate nearest-neighbor search (ANNS) algorithms have become
increasingly critical for recent AI applications, particularly in
retrieval-augmented generation (RAG) and agent-based LLM applications. In this
paper, we present CRINN, a new paradigm for ANNS algorithms. CRINN treats ANNS
optimization as a reinforcement learning problem where execution speed serves
as the reward signal. This approach enables the automatic generation of
progressively faster ANNS implementations while maintaining accuracy
constraints. Our experimental evaluation demonstrates CRINN's effectiveness
across six widely-used NNS benchmark datasets. When compared against
state-of-the-art open-source ANNS algorithms, CRINN achieves best performance
on three of them (GIST-960-Euclidean, MNIST-784-Euclidean, and
GloVe-25-angular), and tied for first place on two of them (SIFT-128-Euclidean
and GloVe-25-angular). The implications of CRINN's success reach well beyond
ANNS optimization: It validates that LLMs augmented with reinforcement learning
can function as an effective tool for automating sophisticated algorithmic
optimizations that demand specialized knowledge and labor-intensive manual
refinement. Code can be found at https://github.com/deepreinforce-ai/CRINN
comment: Preprint Version
♻ ☆ Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as
a key paradigm for post-training Large Language Models (LLMs), particularly for
complex reasoning tasks. However, vanilla RLVR training has been shown to
improve Pass@1 performance at the expense of policy entropy, leading to reduced
generation diversity and limiting the Pass@k performance, which typically
represents the upper bound of LLM reasoning capability. In this paper, we
systematically analyze the policy's generation diversity from the perspective
of training problems and find that augmenting and updating training problems
helps mitigate entropy collapse during training. Based on these observations,
we propose an online Self-play with Variational problem Synthesis (SvS)
strategy for RLVR training, which uses the policy's correct solutions to
synthesize variational problems while ensuring their reference answers remain
identical to the originals. This self-improving strategy effectively maintains
policy entropy during training and substantially improves Pass@k compared with
standard RLVR, sustaining prolonged improvements and achieving absolute gains
of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and
AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model
sizes from 3B to 32B consistently demonstrate the generalizability and
robustness of SvS.
♻ ☆ ChuLo: Chunk-Level Key Information Representation for Long Document Understanding ACL 2025
Transformer-based models have achieved remarkable success in various Natural
Language Processing (NLP) tasks, yet their ability to handle long documents is
constrained by computational limitations. Traditional approaches, such as
truncating inputs, sparse self-attention, and chunking, attempt to mitigate
these issues, but they often lead to information loss and hinder the model's
ability to capture long-range dependencies. In this paper, we introduce ChuLo,
a novel chunk representation method for long document understanding that
addresses these limitations. Our ChuLo groups input tokens using unsupervised
keyphrase extraction, emphasizing semantically important keyphrase based chunks
to retain core document content while reducing input length. This approach
minimizes information loss and improves the efficiency of Transformer-based
models. Preserving all tokens in long document understanding, especially token
classification tasks, is important to ensure that fine-grained annotations,
which depend on the entire sequence context, are not lost. We evaluate our
method on multiple long document classification tasks and long document token
classification tasks, demonstrating its effectiveness through comprehensive
qualitative and quantitative analysis. Our implementation is open-sourced on
https://github.com/adlnlp/Chulo.
comment: The paper has been accepted to ACL 2025