Computation and Language
☆ DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning
Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu
The capacity for complex mathematical reasoning is a key benchmark for
artificial intelligence. While reinforcement learning (RL) applied to LLMs
shows promise, progress is significantly hindered by the lack of large-scale
training data that is sufficiently challenging, possesses verifiable answer
formats suitable for RL, and is free from contamination with evaluation
benchmarks. To address these limitations, we introduce DeepMath-103K, a new,
large-scale dataset comprising approximately 103K mathematical problems,
specifically designed to train advanced reasoning models via RL. DeepMath-103K
is curated through a rigorous pipeline involving source analysis, stringent
decontamination against numerous benchmarks, and filtering for high difficulty
(primarily Levels 5-9), significantly exceeding existing open resources in
challenge. Each problem includes a verifiable final answer, enabling rule-based
RL, and three distinct R1-generated solutions suitable for diverse training
paradigms like supervised fine-tuning or distillation. Spanning a wide range of
mathematical topics, DeepMath-103K promotes the development of generalizable
reasoning. We demonstrate that models trained on DeepMath-103K achieve
significant improvements on challenging mathematical benchmarks, validating its
effectiveness. We release DeepMath-103K publicly to facilitate community
progress in building more capable AI reasoning systems:
https://github.com/zwhe99/DeepMath.
comment: WIP
☆ TextArena
TextArena is an open-source collection of competitive text-based games for
training and evaluation of agentic behavior in Large Language Models (LLMs). It
spans 57+ unique environments (including single-player, two-player, and
multi-player setups) and allows for easy evaluation of model capabilities via
an online-play system (against humans and other submitted models) with
real-time TrueSkill scores. Traditional benchmarks rarely assess dynamic social
skills such as negotiation, theory of mind, and deception, creating a gap that
TextArena addresses. Designed with research, community and extensibility in
mind, TextArena emphasizes ease of adding new games, adapting the framework,
testing models, playing against the models, and training models. Detailed
documentation of environments, games, leaderboard, and examples are available
on https://github.com/LeonGuertler/TextArena and https://www.textarena.ai/.
comment: work in progress; 5 pages, 3 figures
☆ TADACap: Time-series Adaptive Domain-Aware Captioning
Elizabeth Fons, Rachneet Kaur, Zhen Zeng, Soham Palande, Tucker Balch, Svitlana Vyetrenko, Manuela Veloso
While image captioning has gained significant attention, the potential of
captioning time-series images, prevalent in areas like finance and healthcare,
remains largely untapped. Existing time-series captioning methods typically
offer generic, domain-agnostic descriptions of time-series shapes and struggle
to adapt to new domains without substantial retraining. To address these
limitations, we introduce TADACap, a retrieval-based framework to generate
domain-aware captions for time-series images, capable of adapting to new
domains without retraining. Building on TADACap, we propose a novel retrieval
strategy that retrieves diverse image-caption pairs from a target domain
database, namely TADACap-diverse. We benchmarked TADACap-diverse against
state-of-the-art methods and ablation variants. TADACap-diverse demonstrates
comparable semantic accuracy while requiring significantly less annotation
effort.
comment: Accepted to ICAIF 2024
☆ Masculine Defaults via Gendered Discourse in Podcasts and Large Language Models
Masculine defaults are widely recognized as a significant type of gender
bias, but they are often unseen as they are under-researched. Masculine
defaults involve three key parts: (i) the cultural context, (ii) the masculine
characteristics or behaviors, and (iii) the reward for, or simply acceptance
of, those masculine characteristics or behaviors. In this work, we study
discourse-based masculine defaults, and propose a twofold framework for (i) the
large-scale discovery and analysis of gendered discourse words in spoken
content via our Gendered Discourse Correlation Framework (GDCF); and (ii) the
measurement of the gender bias associated with these gendered discourse words
in LLMs via our Discourse Word-Embedding Association Test (D-WEAT). We focus
our study on podcasts, a popular and growing form of social media, analyzing
15,117 podcast episodes. We analyze correlations between gender and discourse
words -- discovered via LDA and BERTopic -- to automatically form gendered
discourse word lists. We then study the prevalence of these gendered discourse
words in domain-specific contexts, and find that gendered discourse-based
masculine defaults exist in the domains of business, technology/politics, and
video games. Next, we study the representation of these gendered discourse
words from a state-of-the-art LLM embedding model from OpenAI, and find that
the masculine discourse words have a more stable and robust representation than
the feminine discourse words, which may result in better system performance on
downstream tasks for men. Hence, men are rewarded for their discourse patterns
with better system performance by one of the state-of-the-art language models
-- and this embedding disparity is a representational harm and a masculine
default.
comment: To appear in ICWSM 2025
☆ A Dual-Space Framework for General Knowledge Distillation of Large Language Models
Knowledge distillation (KD) is a promising solution to compress large
language models (LLMs) by transferring their knowledge to smaller models.
During this process, white-box KD methods usually minimize the distance between
the output distributions of the teacher model and the student model to transfer
more information. However, we reveal that the current white-box KD framework
exhibits two limitations: a) bridging probability distributions from different
output spaces will limit the similarity between the teacher model and the
student model; b) this framework cannot be applied to LLMs with different
vocabularies. One of the root causes for these limitations is that the
distributions from the teacher and the student for KD are output by different
prediction heads, which yield distributions in different output spaces and
dimensions. Therefore, in this paper, we propose a dual-space knowledge
distillation (DSKD) framework that unifies the prediction heads of the teacher
and the student models for KD. Specifically, we first introduce two projectors
with ideal initialization to project the teacher/student hidden states into the
student/teacher representation spaces. After this, the hidden states from
different models can share the same head and unify the output spaces of the
distributions. Furthermore, we develop an exact token alignment (ETA) algorithm
to align the same tokens in two differently-tokenized sequences. Based on the
above, our DSKD framework is a general KD framework that supports both
off-policy and on-policy KD, and KD between any two LLMs regardless of their
vocabularies. Extensive experiments on instruction-following, mathematical
reasoning, and code generation benchmarks show that DSKD significantly
outperforms existing methods based on the current white-box KD framework and
surpasses other cross-tokenizer KD methods for LLMs with different
vocabularies.
comment: 19 pages, 9 figures, 11 tables, under review. Code is available at:
https://github.com/songmzhang/DSKDv2. arXiv admin note: text overlap with
arXiv:2406.17328
☆ Reinforcing Compositional Retrieval: Retrieving Step-by-Step for Composing Informative Contexts
Large Language Models (LLMs) have demonstrated remarkable capabilities across
numerous tasks, yet they often rely on external context to handle complex
tasks. While retrieval-augmented frameworks traditionally focus on selecting
top-ranked documents in a single pass, many real-world scenarios demand
compositional retrieval, where multiple sources must be combined in a
coordinated manner. In this work, we propose a tri-encoder sequential retriever
that models this process as a Markov Decision Process (MDP), decomposing the
probability of retrieving a set of elements into a sequence of conditional
probabilities and allowing each retrieval step to be conditioned on previously
selected examples. We train the retriever in two stages: first, we efficiently
construct supervised sequential data for initial policy training; we then
refine the policy to align with the LLM's preferences using a reward grounded
in the structural correspondence of generated programs. Experimental results
show that our method consistently and significantly outperforms baselines,
underscoring the importance of explicitly modeling inter-example dependencies.
These findings highlight the potential of compositional retrieval for tasks
requiring multiple pieces of evidence or examples.
comment: 19 pages, 8 figures
☆ Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning
Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Marcin Chochowski, Yashaswi Karnati, Raviraj Joshi, Ameya Sunil Mahabaleshwarkar, Zijia Chen, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov
Hybrid LLM architectures that combine Attention and State Space Models (SSMs)
achieve state-of-the-art accuracy and runtime performance. Recent work has
demonstrated that applying compression and distillation to Attention-only
models yields smaller, more accurate models at a fraction of the training cost.
In this work, we explore the effectiveness of compressing Hybrid architectures.
We introduce a novel group-aware pruning strategy that preserves the structural
integrity of SSM blocks and their sequence modeling capabilities. Furthermore,
we demonstrate the necessity of such SSM pruning to achieve improved accuracy
and inference speed compared to traditional approaches. Our compression recipe
combines SSM, FFN, embedding dimension, and layer pruning, followed by
knowledge distillation-based retraining, similar to the MINITRON technique.
Using this approach, we compress the Nemotron-H 8B Hybrid model down to 4B
parameters with up to 40x fewer training tokens. The resulting model surpasses
the accuracy of similarly-sized models while achieving 2x faster inference,
significantly advancing the Pareto frontier.
☆ DataDecide: How to Predict Best Pretraining Data with Small Experiments
Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge
Because large language models are expensive to pretrain on different
datasets, using smaller-scale experiments to decide on data is crucial for
reducing costs. Which benchmarks and methods of making decisions from observed
performance at small scale most accurately predict the datasets that yield the
best large models? To empower open exploration of this question, we release
models, data, and evaluations in DataDecide -- the most extensive open suite of
models over differences in data and scale. We conduct controlled pretraining
experiments across 25 corpora with differing sources, deduplication, and
filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random
seeds. We find that the ranking of models at a single, small size (e.g., 150M
parameters) is a strong baseline for predicting best models at our larger
target scale (1B) (~80% of com parisons correct). No scaling law methods among
8 baselines exceed the compute-decision frontier of single-scale predictions,
but DataDecide can measure improvement in future scaling laws. We also identify
that using continuous likelihood metrics as proxies in small experiments makes
benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval >80% predictable
at the target 1B scale with just 0.01% of the compute.
☆ RankAlign: A Ranking View of the Generator-Validator Gap in Large Language Models
Although large language models (LLMs) have become generally more capable and
accurate across many tasks, some fundamental sources of unreliability remain in
their behavior. One key limitation is their inconsistency at reporting the the
same information when prompts are changed. In this paper, we consider the
discrepancy between a model's generated answer and their own verification of
that answer, the generator-validator gap. We define this gap in a more
stringent way than prior work: we expect correlation of scores from a generator
and a validator over the entire set of candidate answers. We show that
according to this measure, a large gap exists in various settings, including
question answering, lexical semantics tasks, and next-word prediction. We then
propose RankAlign, a ranking-based training method, and show that it
significantly closes the gap by 31.8% on average, surpassing all baseline
methods. Moreover, this approach generalizes well to out-of-domain tasks and
lexical items.
☆ Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions
Wang Bill Zhu, Tianqi Chen, Ching Ying Lin, Jade Law, Mazen Jizzini, Jorge J. Nieva, Ruishan Liu, Robin Jia
Cancer patients are increasingly turning to large language models (LLMs) as a
new form of internet search for medical information, making it critical to
assess how well these models handle complex, personalized questions. However,
current medical benchmarks focus on medical exams or consumer-searched
questions and do not evaluate LLMs on real patient questions with detailed
clinical contexts. In this paper, we first evaluate LLMs on cancer-related
questions drawn from real patients, reviewed by three hematology oncology
physicians. While responses are generally accurate, with GPT-4-Turbo scoring
4.13 out of 5, the models frequently fail to recognize or address false
presuppositions in the questions-posing risks to safe medical decision-making.
To study this limitation systematically, we introduce Cancer-Myth, an
expert-verified adversarial dataset of 585 cancer-related questions with false
presuppositions. On this benchmark, no frontier LLM -- including GPT-4o,
Gemini-1.Pro, and Claude-3.5-Sonnet -- corrects these false presuppositions
more than 30% of the time. Even advanced medical agentic methods do not prevent
LLMs from ignoring false presuppositions. These findings expose a critical gap
in the clinical reliability of LLMs and underscore the need for more robust
safeguards in medical AI systems.
☆ OpenTuringBench: An Open-Model-based Benchmark and Framework for Machine-Generated Text Detection and Attribution
Open Large Language Models (OLLMs) are increasingly leveraged in generative
AI applications, posing new challenges for detecting their outputs. We propose
OpenTuringBench, a new benchmark based on OLLMs, designed to train and evaluate
machine-generated text detectors on the Turing Test and Authorship Attribution
problems. OpenTuringBench focuses on a representative set of OLLMs, and
features a number of challenging evaluation tasks, including
human/machine-manipulated texts, out-of-domain texts, and texts from previously
unseen models. We also provide OTBDetector, a contrastive learning framework to
detect and attribute OLLM-based machine-generated texts. Results highlight the
relevance and varying degrees of difficulty of the OpenTuringBench tasks, with
our detector achieving remarkable capabilities across the various tasks and
outperforming most existing detectors. Resources are available on the
OpenTuringBench Hugging Face repository at
https://huggingface.co/datasets/MLNTeam-Unical/OpenTuringBench
comment: Under review with ARR
☆ Network Alignment
Rui Tang, Ziyun Yong, Shuyu Jiang, Xingshu Chen, Yaofang Liu, Yi-Cheng Zhang, Gui-Quan Sun, Wei Wang
Complex networks are frequently employed to model physical or virtual complex
systems. When certain entities exist across multiple systems simultaneously,
unveiling their corresponding relationships across the networks becomes
crucial. This problem, known as network alignment, holds significant
importance. It enhances our understanding of complex system structures and
behaviours, facilitates the validation and extension of theoretical physics
research about studying complex systems, and fosters diverse practical
applications across various fields. However, due to variations in the
structure, characteristics, and properties of complex networks across different
fields, the study of network alignment is often isolated within each domain,
with even the terminologies and concepts lacking uniformity. This review
comprehensively summarizes the latest advancements in network alignment
research, focusing on analyzing network alignment characteristics and progress
in various domains such as social network analysis, bioinformatics,
computational linguistics and privacy protection. It provides a detailed
analysis of various methods' implementation principles, processes, and
performance differences, including structure consistency-based methods, network
embedding-based methods, and graph neural network-based (GNN-based) methods.
Additionally, the methods for network alignment under different conditions,
such as in attributed networks, heterogeneous networks, directed networks, and
dynamic networks, are presented. Furthermore, the challenges and the open
issues for future studies are also discussed.
☆ Teaching Large Language Models to Reason through Learning and Forgetting
Leveraging inference-time search in large language models has proven
effective in further enhancing a trained model's capability to solve complex
mathematical and reasoning problems. However, this approach significantly
increases computational costs and inference time, as the model must generate
and evaluate multiple candidate solutions to identify a viable reasoning path.
To address this, we propose an effective approach that integrates search
capabilities directly into the model by fine-tuning it using both successful
(learning) and failed reasoning paths (forgetting) derived from diverse search
methods. While fine-tuning the model with these data might seem
straightforward, we identify a critical issue: the model's search capability
tends to degrade rapidly if fine-tuning is performed naively. We show that this
degradation can be substantially mitigated by employing a smaller learning
rate. Extensive experiments on the challenging Game-of-24 and Countdown
mathematical reasoning benchmarks show that our approach not only outperforms
both standard fine-tuning and inference-time search baselines but also
significantly reduces inference time by 180$\times$.
☆ A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce
Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, Hanze Dong
Reinforcement learning (RL) has become a prevailing approach for fine-tuning
large language models (LLMs) on complex reasoning tasks. Among recent methods,
GRPO stands out for its empirical success in training models such as
DeepSeek-R1, yet the sources of its effectiveness remain poorly understood. In
this work, we revisit GRPO from a reinforce-like algorithm perspective and
analyze its core components. Surprisingly, we find that a simple rejection
sampling baseline, RAFT, which trains only on positively rewarded samples,
yields competitive performance than GRPO and PPO. Our ablation studies reveal
that GRPO's main advantage arises from discarding prompts with entirely
incorrect responses, rather than from its reward normalization. Motivated by
this insight, we propose Reinforce-Rej, a minimal extension of policy gradient
that filters both entirely incorrect and entirely correct samples.
Reinforce-Rej improves KL efficiency and stability, serving as a lightweight
yet effective alternative to more complex RL algorithms. We advocate RAFT as a
robust and interpretable baseline, and suggest that future advances should
focus on more principled designs for incorporating negative samples, rather
than relying on them indiscriminately. Our findings provide guidance for future
work in reward-based LLM post-training.
comment: 12 pages, 4 figures
☆ REWARD CONSISTENCY: Improving Multi-Objective Alignment from a Data-Centric Perspective
Multi-objective preference alignment in language models often encounters a
challenging trade-off: optimizing for one human preference (e.g., helpfulness)
frequently compromises others (e.g., harmlessness) due to the inherent
conflicts between competing objectives. While prior work mainly focuses on
algorithmic solutions, we explore a novel data-driven approach to uncover the
types of data that can effectively mitigate these conflicts. Specifically, we
propose the concept of Reward Consistency (RC), which identifies samples that
align with multiple preference objectives, thereby reducing conflicts during
training. Through gradient-based analysis, we demonstrate that RC-compliant
samples inherently constrain performance degradation during multi-objective
optimization. Building on these insights, we further develop Reward Consistency
Sampling, a framework that automatically constructs preference datasets that
effectively mitigate conflicts during multi-objective alignment. Our generated
data achieves an average improvement of 13.37% in both the harmless rate and
helpfulness win rate when optimizing harmlessness and helpfulness, and can
consistently resolve conflicts in varying multi-objective scenarios.
☆ Looking beyond the next token
The structure of causal language model training assumes that each token can
be accurately predicted from the previous context. This contrasts with humans'
natural writing and reasoning process, where goals are typically known before
the exact argument or phrasings. While this mismatch has been well studied in
the literature, the working assumption has been that architectural changes are
needed to address this mismatch. We argue that rearranging and processing the
training data sequences can allow models to more accurately imitate the true
data-generating process, and does not require any other changes to the
architecture or training infrastructure. We demonstrate that this technique,
Trelawney, and the inference algorithms derived from it allow us to improve
performance on several key benchmarks that span planning, algorithmic
reasoning, and story generation tasks. Finally, our method naturally enables
the generation of long-term goals at no additional cost. We investigate how
using the model's goal-generation capability can further improve planning and
reasoning. Additionally, we believe Trelawney could potentially open doors to
new capabilities beyond the current language modeling paradigm.
☆ Dependency Structure Augmented Contextual Scoping Framework for Multimodal Aspect-Based Sentiment Analysis ACM MM2025
Multimodal Aspect-Based Sentiment Analysis (MABSA) seeks to extract
fine-grained information from image-text pairs to identify aspect terms and
determine their sentiment polarity. However, existing approaches often fall
short in simultaneously addressing three core challenges: Sentiment Cue
Perception (SCP), Multimodal Information Misalignment (MIM), and Semantic Noise
Elimination (SNE). To overcome these limitations, we propose DASCO
(\textbf{D}ependency Structure \textbf{A}ugmented \textbf{Sco}ping Framework),
a fine-grained scope-oriented framework that enhances aspect-level sentiment
reasoning by leveraging dependency parsing trees. First, we designed a
multi-task pretraining strategy for MABSA on our base model, combining
aspect-oriented enhancement, image-text matching, and aspect-level
sentiment-sensitive cognition. This improved the model's perception of aspect
terms and sentiment cues while achieving effective image-text alignment,
addressing key challenges like SCP and MIM. Furthermore, we incorporate
dependency trees as syntactic branch combining with semantic branch, guiding
the model to selectively attend to critical contextual elements within a
target-specific scope while effectively filtering out irrelevant noise for
addressing SNE problem. Extensive experiments on two benchmark datasets across
three subtasks demonstrate that DASCO achieves state-of-the-art performance in
MABSA, with notable gains in JMASA (+3.1\% F1 and +5.4\% precision on
Twitter2015).
comment: submitted to ACM MM2025
☆ Automated Python Translation
Python is one of the most commonly used programming languages in industry and
education. Its English keywords and built-in functions/modules allow it to come
close to pseudo-code in terms of its readability and ease of writing. However,
those who do not speak English may not experience these advantages. In fact,
they may even be hindered in their ability to understand Python code, as the
English nature of its terms creates an additional layer of overhead. To that
end, we introduce the task of automatically translating Python's natural
modality (keywords, error types, identifiers, etc.) into other human languages.
This presents a unique challenge, considering the abbreviated nature of these
forms, as well as potential untranslatability of advanced
mathematical/programming concepts across languages. We therefore create an
automated pipeline to translate Python into other human languages, comparing
strategies using machine translation and large language models. We then use
this pipeline to acquire translations from five common Python libraries
(pytorch, pandas, tensorflow, numpy, and random) in seven languages, and do a
quality test on a subset of these terms in French, Greek, and Bengali. We hope
this will provide a clearer path forward towards creating a universal Python,
accessible to anyone regardless of nationality or language background.
comment: 15 pages, 4 figures, 17 tables
☆ The Obvious Invisible Threat: LLM-Powered GUI Agents' Vulnerability to Fine-Print Injections
Chaoran Chen, Zhiping Zhang, Bingcan Guo, Shang Ma, Ibrahim Khalilov, Simret A Gebreegziabher, Yanfang Ye, Ziang Xiao, Yaxing Yao, Tianshi Li, Toby Jia-Jun Li
A Large Language Model (LLM) powered GUI agent is a specialized autonomous
system that performs tasks on the user's behalf according to high-level
instructions. It does so by perceiving and interpreting the graphical user
interfaces (GUIs) of relevant apps, often visually, inferring necessary
sequences of actions, and then interacting with GUIs by executing the actions
such as clicking, typing, and tapping. To complete real-world tasks, such as
filling forms or booking services, GUI agents often need to process and act on
sensitive user data. However, this autonomy introduces new privacy and security
risks. Adversaries can inject malicious content into the GUIs that alters agent
behaviors or induces unintended disclosures of private information. These
attacks often exploit the discrepancy between visual saliency for agents and
human users, or the agent's limited ability to detect violations of contextual
integrity in task automation. In this paper, we characterized six types of such
attacks, and conducted an experimental study to test these attacks with six
state-of-the-art GUI agents, 234 adversarial webpages, and 39 human
participants. Our findings suggest that GUI agents are highly vulnerable,
particularly to contextually embedded threats. Moreover, human users are also
susceptible to many of these attacks, indicating that simple human oversight
may not reliably prevent failures. This misalignment highlights the need for
privacy-aware agent design. We propose practical defense strategies to inform
the development of safer and more reliable GUI agents.
☆ From Misleading Queries to Accurate Answers: A Three-Stage Fine-Tuning Method for LLMs
Large language models (LLMs) exhibit excellent performance in natural
language processing (NLP), but remain highly sensitive to the quality of input
queries, especially when these queries contain misleading or inaccurate
information. Existing methods focus on correcting the output, but they often
overlook the potential of improving the ability of LLMs to detect and correct
misleading content in the input itself. In this paper, we propose a novel
three-stage fine-tuning method that enhances the ability of LLMs to detect and
correct misleading information in the input, further improving response
accuracy and reducing hallucinations. Specifically, the three stages include
(1) training LLMs to identify misleading information, (2) training LLMs to
correct the misleading information using built-in or external knowledge, and
(3) training LLMs to generate accurate answers based on the corrected queries.
To evaluate our method, we conducted experiments on three datasets for the
hallucination detection task and the question answering (QA) task, as well as
two datasets containing misleading information that we constructed. The
experimental results demonstrate that our method significantly improves the
accuracy and factuality of LLM responses, while also enhancing the ability to
detect hallucinations and reducing the generation of hallucinations in the
output, particularly when the query contains misleading information. We will
publicly release our code upon acceptance.
☆ UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis
Recent advancements in Large Vision-Language Models are accelerating the
development of Graphical User Interface (GUI) agents that utilize human-like
vision perception capabilities to enhance productivity on digital devices.
Compared to approaches predicated on GUI metadata, which are platform-dependent
and vulnerable to implementation variations, vision-based approaches offer
broader applicability. In this vision-based paradigm, the GUI instruction
grounding, which maps user instruction to the location of corresponding element
on the given screenshot, remains a critical challenge, particularly due to
limited public training dataset and resource-intensive manual instruction data
annotation.In this paper, we delve into unexplored challenges in this task
including element-to-screen ratio, unbalanced element type, and implicit
instruction. To address these challenges, we introduce a large-scale data
synthesis pipeline UI-E2I-Synth for generating varying complex instruction
datasets using GPT-4o instead of human annotators. Furthermore, we propose a
new GUI instruction grounding benchmark UI-I2E-Bench, which is designed to
address the limitations of existing benchmarks by incorporating diverse
annotation aspects. Our model, trained on the synthesized data, achieves
superior performance in GUI instruction grounding, demonstrating the
advancements of proposed data synthesis pipeline. The proposed benchmark,
accompanied by extensive analyses, provides practical insights for future
research in GUI grounding. We will release corresponding artifacts at
https://colmon46.github.io/i2e-bench-leaderboard/
☆ Towards Automated Safety Requirements Derivation Using Agent-based RAG
Balahari Vignesh Balu, Florian Geissler, Francesco Carella, Joao-Vitor Zacchi, Josef Jiru, Nuria Mata, Reinhard Stolle
We study the automated derivation of safety requirements in a self-driving
vehicle use case, leveraging LLMs in combination with agent-based
retrieval-augmented generation. Conventional approaches that utilise
pre-trained LLMs to assist in safety analyses typically lack domain-specific
knowledge. Existing RAG approaches address this issue, yet their performance
deteriorates when handling complex queries and it becomes increasingly harder
to retrieve the most relevant information. This is particularly relevant for
safety-relevant applications. In this paper, we propose the use of agent-based
RAG to derive safety requirements and show that the retrieved information is
more relevant to the queries. We implement an agent-based approach on a
document pool of automotive standards and the Apollo case study, as a
representative example of an automated driving perception system. Our solution
is tested on a data set of safety requirement questions and answers, extracted
from the Apollo data. Evaluating a set of selected RAG metrics, we present and
discuss advantages of a agent-based approach compared to default RAG methods.
comment: 9 pages, 3 figures
☆ Nondeterministic Polynomial-time Problem Challenge: An Ever-Scaling Reasoning Benchmark for LLMs
Chang Yang, Ruiyu Wang, Junzhe Jiang, Qi Jiang, Qinggang Zhang, Yanchen Deng, Shuxin Li, Shuyue Hu, Bo Li, Florian T. Pokorny, Xiao Huang, Xinrun Wang
Reasoning is the fundamental capability of large language models (LLMs). Due
to the rapid progress of LLMs, there are two main issues of current benchmarks:
i) these benchmarks can be crushed in a short time (less than 1 year), and ii)
these benchmarks may be easily hacked. To handle these issues, we propose the
ever-scalingness for building the benchmarks which are uncrushable, unhackable,
auto-verifiable and general. This paper presents Nondeterministic
Polynomial-time Problem Challenge (NPPC), an ever-scaling reasoning benchmark
for LLMs. Specifically, the NPPC has three main modules: i) npgym, which
provides a unified interface of 25 well-known NP-complete problems and can
generate any number of instances with any levels of complexities, ii) npsolver:
which provides a unified interface to evaluate the problem instances with both
online and offline models via APIs and local deployments, respectively, and
iii) npeval: which provides the comprehensive and ready-to-use tools to analyze
the performances of LLMs over different problems, the number of tokens, the aha
moments, the reasoning errors and the solution errors. Extensive experiments
over widely-used LLMs demonstrate: i) NPPC can successfully decrease the
performances of advanced LLMs' performances to below 10%, demonstrating that
NPPC is uncrushable, ii) DeepSeek-R1, Claude-3.7-Sonnet, and o1/o3-mini are the
most powerful LLMs, where DeepSeek-R1 outperforms Claude-3.7-Sonnet and
o1/o3-mini in most NP-complete problems considered, and iii) the numbers of
tokens, aha moments in the advanced LLMs, e.g., Claude-3.7-Sonnet and
DeepSeek-R1, are observed first to increase and then decrease when the problem
instances become more and more difficult. We believe that NPPC is the first
ever-scaling reasoning benchmark, serving as the uncrushable and unhackable
testbed for LLMs toward artificial general intelligence (AGI).
comment: Preliminary work, 10 pages for main text
☆ Enhancing multimodal analogical reasoning with Logic Augmented Generation
Recent advances in Large Language Models have demonstrated their capabilities
across a variety of tasks. However, automatically extracting implicit knowledge
from natural language remains a significant challenge, as machines lack active
experience with the physical world. Given this scenario, semantic knowledge
graphs can serve as conceptual spaces that guide the automated text generation
reasoning process to achieve more efficient and explainable results. In this
paper, we apply a logic-augmented generation (LAG) framework that leverages the
explicit representation of a text through a semantic knowledge graph and
applies it in combination with prompt heuristics to elicit implicit analogical
connections. This method generates extended knowledge graph triples
representing implicit meaning, enabling systems to reason on unlabeled
multimodal data regardless of the domain. We validate our work through three
metaphor detection and understanding tasks across four datasets, as they
require deep analogical reasoning capabilities. The results show that this
integrated approach surpasses current baselines, performs better than humans in
understanding visual metaphors, and enables more explainable reasoning
processes, though still has inherent limitations in metaphor understanding,
especially for domain-specific metaphors. Furthermore, we propose a thorough
error analysis, discussing issues with metaphorical annotations and current
evaluation methods.
☆ Benchmarking Next-Generation Reasoning-Focused Large Language Models in Ophthalmology: A Head-to-Head Evaluation on 5,888 Items
Minjie Zou, Sahana Srinivasan, Thaddaeus Wai Soon Lo, Ke Zou, Gabriel Dawei Yang, Xuguang Ai, Hyunjae Kim, Maxwell Singer, Fares Antaki, Kelvin Li, Robert Chang, Marcus Tan, David Ziyou Chen, Dianbo Liu, Qingyu Chen, Yih Chung Tham
Recent advances in reasoning-focused large language models (LLMs) mark a
shift from general LLMs toward models designed for complex decision-making, a
crucial aspect in medicine. However, their performance in specialized domains
like ophthalmology remains underexplored. This study comprehensively evaluated
and compared the accuracy and reasoning capabilities of four newly developed
reasoning-focused LLMs, namely DeepSeek-R1, OpenAI o1, o3-mini, and Gemini 2.0
Flash-Thinking. Each model was assessed using 5,888 multiple-choice
ophthalmology exam questions from the MedMCQA dataset in zero-shot setting.
Quantitative evaluation included accuracy, Macro-F1, and five text-generation
metrics (ROUGE-L, METEOR, BERTScore, BARTScore, and AlignScore), computed
against ground-truth reasonings. Average inference time was recorded for a
subset of 100 randomly selected questions. Additionally, two board-certified
ophthalmologists qualitatively assessed clarity, completeness, and reasoning
structure of responses to differential diagnosis questions.O1 (0.902) and
DeepSeek-R1 (0.888) achieved the highest accuracy, with o1 also leading in
Macro-F1 (0.900). The performance of models across the text-generation metrics
varied: O3-mini excelled in ROUGE-L (0.151), o1 in METEOR (0.232), DeepSeek-R1
and o3-mini tied for BERTScore (0.673), DeepSeek-R1 (-4.105) and Gemini 2.0
Flash-Thinking (-4.127) performed best in BARTScore, while o3-mini (0.181) and
o1 (0.176) led AlignScore. Inference time across the models varied, with
DeepSeek-R1 being slowest (40.4 seconds) and Gemini 2.0 Flash-Thinking fastest
(6.7 seconds). Qualitative evaluation revealed that DeepSeek-R1 and Gemini 2.0
Flash-Thinking tended to provide detailed and comprehensive intermediate
reasoning, whereas o1 and o3-mini displayed concise and summarized
justifications.
comment: 83 pages, 6 figures, 3 tables, 9 supplementary figures, 7
supplementary tables
☆ Bias Beyond English: Evaluating Social Bias and Debiasing Methods in a Low-Resource Setting
Social bias in language models can potentially exacerbate social
inequalities. Despite it having garnered wide attention, most research focuses
on English data. In a low-resource scenario, the models often perform worse due
to insufficient training data. This study aims to leverage high-resource
language corpora to evaluate bias and experiment with debiasing methods in
low-resource languages. We evaluated the performance of recent multilingual
models in five languages: English (\textsc{eng}), Chinese (\textsc{zho}),
Russian (\textsc{rus}), Indonesian (\textsc{ind}) and Thai (\textsc{tha}), and
analyzed four bias dimensions: \textit{gender}, \textit{religion},
\textit{nationality}, and \textit{race-color}. By constructing multilingual
bias evaluation datasets, this study allows fair comparisons between models
across languages. We have further investigated three debiasing
methods-\texttt{CDA}, \texttt{Dropout}, \texttt{SenDeb}-and demonstrated that
debiasing methods from high-resource languages can be effectively transferred
to low-resource ones, providing actionable insights for fairness research in
multilingual NLP.
☆ MuSeD: A Multimodal Spanish Dataset for Sexism Detection in Social Media Videos
Laura De Grazia, Pol Pastells, Mauro Vázquez Chas, Desmond Elliott, Danae Sánchez Villegas, Mireia Farrús, Mariona Taulé
Sexism is generally defined as prejudice and discrimination based on sex or
gender, affecting every sector of society, from social institutions to
relationships and individual behavior. Social media platforms amplify the
impact of sexism by conveying discriminatory content not only through text but
also across multiple modalities, highlighting the critical need for a
multimodal approach to the analysis of sexism online. With the rise of social
media platforms where users share short videos, sexism is increasingly
spreading through video content. Automatically detecting sexism in videos is a
challenging task, as it requires analyzing the combination of verbal, audio,
and visual elements to identify sexist content. In this study, (1) we introduce
MuSeD, a new Multimodal Spanish dataset for Sexism Detection consisting of
$\approx$ 11 hours of videos extracted from TikTok and BitChute; (2) we propose
an innovative annotation framework for analyzing the contribution of textual
and multimodal labels in the classification of sexist and non-sexist content;
and (3) we evaluate a range of large language models (LLMs) and multimodal LLMs
on the task of sexism detection. We find that visual information plays a key
role in labeling sexist content for both humans and models. Models effectively
detect explicit sexism; however, they struggle with implicit cases, such as
stereotypes, instances where annotators also show low agreement. This
highlights the inherent difficulty of the task, as identifying implicit sexism
depends on the social and cultural context.
☆ Benchmarking Vision Language Models on German Factual Data
Similar to LLMs, the development of vision language models is mainly driven
by English datasets and models trained in English and Chinese language, whereas
support for other languages, even those considered high-resource languages such
as German, remains significantly weaker. In this work we present an analysis of
open-weight VLMs on factual knowledge in the German and English language. We
disentangle the image-related aspects from the textual ones by analyzing
accu-racy with jury-as-a-judge in both prompt languages and images from German
and international contexts. We found that for celebrities and sights, VLMs
struggle because they are lacking visual cognition of German image contents.
For animals and plants, the tested models can often correctly identify the
image contents ac-cording to the scientific name or English common name but
fail in German lan-guage. Cars and supermarket products were identified equally
well in English and German images across both prompt languages.
☆ Using LLMs as prompt modifier to avoid biases in AI image generators
This study examines how Large Language Models (LLMs) can reduce biases in
text-to-image generation systems by modifying user prompts. We define bias as a
model's unfair deviation from population statistics given neutral prompts. Our
experiments with Stable Diffusion XL, 3.5 and Flux demonstrate that
LLM-modified prompts significantly increase image diversity and reduce bias
without the need to change the image generators themselves. While occasionally
producing results that diverge from original user intent for elaborate prompts,
this approach generally provides more varied interpretations of underspecified
requests rather than superficial variations. The method works particularly well
for less advanced image generators, though limitations persist for certain
contexts like disability representation. All prompts and generated images are
available at https://iisys-hof.github.io/llm-prompt-img-gen/
☆ DeepMLF: Multimodal language model with learnable tokens for deep fusion in sentiment analysis
While multimodal fusion has been extensively studied in Multimodal Sentiment
Analysis (MSA), the role of fusion depth and multimodal capacity allocation
remains underexplored. In this work, we position fusion depth, scalability, and
dedicated multimodal capacity as primary factors for effective fusion. We
introduce DeepMLF, a novel multimodal language model (LM) with learnable tokens
tailored toward deep fusion. DeepMLF leverages an audiovisual encoder and a
pretrained decoder LM augmented with multimodal information across its layers.
We append learnable tokens to the LM that: 1) capture modality interactions in
a controlled fashion and 2) preserve independent information flow for each
modality. These fusion tokens gather linguistic information via causal
self-attention in LM Blocks and integrate with audiovisual information through
cross-attention MM Blocks. Serving as dedicated multimodal capacity, this
design enables progressive fusion across multiple layers, providing depth in
the fusion process. Our training recipe combines modality-specific losses and
language modelling loss, with the decoder LM tasked to predict ground truth
polarity. Across three MSA benchmarks with varying dataset characteristics,
DeepMLF achieves state-of-the-art performance. Our results confirm that deeper
fusion leads to better performance, with optimal fusion depths (5-7) exceeding
those of existing approaches. Additionally, our analysis on the number of
fusion tokens reveals that small token sets ($\sim$20) achieve optimal
performance. We examine the importance of representation learning order (fusion
curriculum) through audiovisual encoder initialization experiments. Our
ablation studies demonstrate the superiority of the proposed fusion design and
gating while providing a holistic examination of DeepMLF's scalability to LLMs,
and the impact of each training objective and embedding regularization.
comment: Preprint
☆ LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews
Peer review is a cornerstone of quality control in scientific publishing.
With the increasing workload, the unintended use of `quick' heuristics,
referred to as lazy thinking, has emerged as a recurring issue compromising
review quality. Automated methods to detect such heuristics can help improve
the peer-reviewing process. However, there is limited NLP research on this
issue, and no real-world dataset exists to support the development of detection
tools. This work introduces LazyReview, a dataset of peer-review sentences
annotated with fine-grained lazy thinking categories. Our analysis reveals that
Large Language Models (LLMs) struggle to detect these instances in a zero-shot
setting. However, instruction-based fine-tuning on our dataset significantly
boosts performance by 10-20 performance points, highlighting the importance of
high-quality training data. Furthermore, a controlled experiment demonstrates
that reviews revised with lazy thinking feedback are more comprehensive and
actionable than those written without such feedback. We will release our
dataset and the enhanced guidelines that can be used to train junior reviewers
in the community. (Code available here:
https://github.com/UKPLab/arxiv2025-lazy-review)
comment: 29 pages, 18 Figures, 15 Tables
☆ Dynamic Compressing Prompts for Efficient Inference of Large Language Models
Large Language Models (LLMs) have shown outstanding performance across a
variety of tasks, partly due to advanced prompting techniques. However, these
techniques often require lengthy prompts, which increase computational costs
and can hinder performance because of the limited context windows of LLMs.
While prompt compression is a straightforward solution, existing methods
confront the challenges of retaining essential information, adapting to context
changes, and remaining effective across different tasks. To tackle these
issues, we propose a task-agnostic method called Dynamic Compressing Prompts
(LLM-DCP). Our method reduces the number of prompt tokens while aiming to
preserve the performance as much as possible. We model prompt compression as a
Markov Decision Process (MDP), enabling the DCP-Agent to sequentially remove
redundant tokens by adapting to dynamic contexts and retaining crucial content.
We develop a reward function for training the DCP-Agent that balances the
compression rate, the quality of the LLM output, and the retention of key
information. This allows for prompt token reduction without needing an external
black-box LLM. Inspired by the progressive difficulty adjustment in curriculum
learning, we introduce a Hierarchical Prompt Compression (HPC) training
strategy that gradually increases the compression difficulty, enabling the
DCP-Agent to learn an effective compression method that maintains information
integrity. Experiments demonstrate that our method outperforms state-of-the-art
techniques, especially at higher compression rates. The code for our approach
will be available at https://github.com/Fhujinwu/DCP.
comment: Under review (submited in 2024.11)
☆ ReZero: Enhancing LLM search ability by trying one-more-time
Retrieval-Augmented Generation (RAG) improves Large Language Model (LLM)
performance on knowledge-intensive tasks but depends heavily on initial search
query quality. Current methods, often using Reinforcement Learning (RL),
typically focus on query formulation or reasoning over results, without
explicitly encouraging persistence after a failed search. We introduce ReZero
(Retry-Zero), a novel RL framework that directly rewards the act of retrying a
search query following an initial unsuccessful attempt. This incentivizes the
LLM to explore alternative queries rather than prematurely halting. ReZero
demonstrates significant improvement, achieving 46.88% accuracy compared to a
25% baseline. By rewarding persistence, ReZero enhances LLM robustness in
complex information-seeking scenarios where initial queries may prove
insufficient.
☆ Exploring the Role of KG-Based RAG in Japanese Medical Question Answering with Small-Scale LLMs
Large language models (LLMs) perform well in medical QA, but their
effectiveness in Japanese contexts is limited due to privacy constraints that
prevent the use of commercial models like GPT-4 in clinical settings. As a
result, recent efforts focus on instruction-tuning open-source LLMs, though the
potential of combining them with retrieval-augmented generation (RAG) remains
underexplored. To bridge this gap, we are the first to explore a knowledge
graph-based (KG) RAG framework for Japanese medical QA small-scale open-source
LLMs. Experimental results show that KG-based RAG has only a limited impact on
Japanese medical QA using small-scale open-source LLMs. Further case studies
reveal that the effectiveness of the RAG is sensitive to the quality and
relevance of the external retrieved content. These findings offer valuable
insights into the challenges and potential of applying RAG in Japanese medical
QA, while also serving as a reference for other low-resource languages.
comment: 10 pages
☆ Understanding LLMs' Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From
The ability of cross-lingual context retrieval is a fundamental aspect of
cross-lingual alignment of large language models (LLMs), where the model
extracts context information in one language based on requests in another
language. Despite its importance in real-life applications, this ability has
not been adequately investigated for state-of-the-art models. In this paper, we
evaluate the cross-lingual context retrieval ability of over 40 LLMs across 12
languages to understand the source of this ability, using cross-lingual machine
reading comprehension (xMRC) as a representative scenario. Our results show
that several small, post-trained open LLMs show strong cross-lingual context
retrieval ability, comparable to closed-source LLMs such as GPT-4o, and their
estimated oracle performances greatly improve after post-training. Our
interpretability analysis shows that the cross-lingual context retrieval
process can be divided into two main phases: question encoding and answer
retrieval, which are formed in pre-training and post-training, respectively.
The phasing stability correlates with xMRC performance, and the xMRC bottleneck
lies at the last model layers in the second phase, where the effect of
post-training can be evidently observed. Our results also indicate that
larger-scale pretraining cannot improve the xMRC performance. Instead, larger
LLMs need further multilingual post-training to fully unlock their
cross-lingual context retrieval potential. Our code and is available at
https://github.com/NJUNLP/Cross-Lingual-Context-Retrieval
☆ Efficient Reasoning Models: A Survey
Reasoning models have demonstrated remarkable progress in solving complex and
logic-intensive tasks by generating extended Chain-of-Thoughts (CoTs) prior to
arriving at a final answer. Yet, the emergence of this "slow-thinking"
paradigm, with numerous tokens generated in sequence, inevitably introduces
substantial computational overhead. To this end, it highlights an urgent need
for effective acceleration. This survey aims to provide a comprehensive
overview of recent advances in efficient reasoning. It categorizes existing
works into three key directions: (1) shorter - compressing lengthy CoTs into
concise yet effective reasoning chains; (2) smaller - developing compact
language models with strong reasoning capabilities through techniques such as
knowledge distillation, other model compression techniques, and reinforcement
learning; and (3) faster - designing efficient decoding strategies to
accelerate inference. A curated collection of papers discussed in this survey
is available in our GitHub repository.
☆ ARise: Towards Knowledge-Augmented Reasoning via Risk-Adaptive Search
Yize Zhang, Tianshu Wang, Sirui Chen, Kun Wang, Xingyu Zeng, Hongyu Lin, Xianpei Han, Le Sun, Chaochao Lu
Large language models (LLMs) have demonstrated impressive capabilities and
are receiving increasing attention to enhance their reasoning through scaling
test--time compute. However, their application in open--ended,
knowledge--intensive, complex reasoning scenarios is still limited.
Reasoning--oriented methods struggle to generalize to open--ended scenarios due
to implicit assumptions of complete world knowledge. Meanwhile,
knowledge--augmented reasoning (KAR) methods fail to address two core
challenges: 1) error propagation, where errors in early steps cascade through
the chain, and 2) verification bottleneck, where the explore--exploit tradeoff
arises in multi--branch decision processes. To overcome these limitations, we
introduce ARise, a novel framework that integrates risk assessment of
intermediate reasoning states with dynamic retrieval--augmented generation
(RAG) within a Monte Carlo tree search paradigm. This approach enables
effective construction and optimization of reasoning plans across multiple
maintained hypothesis branches. Experimental results show that ARise
significantly outperforms the state--of--the--art KAR methods by up to 23.10%,
and the latest RAG-equipped large reasoning models by up to 25.37%.
comment: Project homepage: https://opencausalab.github.io/ARise
☆ Exploring Persona-dependent LLM Alignment for the Moral Machine Experiment ICLR 2025
Deploying large language models (LLMs) with agency in real-world applications
raises critical questions about how these models will behave. In particular,
how will their decisions align with humans when faced with moral dilemmas? This
study examines the alignment between LLM-driven decisions and human judgment in
various contexts of the moral machine experiment, including personas reflecting
different sociodemographics. We find that the moral decisions of LLMs vary
substantially by persona, showing greater shifts in moral decisions for
critical tasks than humans. Our data also indicate an interesting partisan
sorting phenomenon, where political persona predominates the direction and
degree of LLM decisions. We discuss the ethical implications and risks
associated with deploying these models in applications that involve moral
decisions.
comment: Accepted to ICLR 2025 Workshop - BiAlign (Bidirectional Human-AI
Alignment)
☆ Ai2 Scholar QA: Organized Literature Synthesis with Attribution
Amanpreet Singh, Joseph Chee Chang, Chloe Anastasiades, Dany Haddad, Aakanksha Naik, Amber Tanaka, Angele Zamarron, Cecile Nguyen, Jena D. Hwang, Jason Dunkleberger, Matt Latzke, Smita Rao, Jaron Lochner, Rob Evans, Rodney Kinney, Daniel S. Weld, Doug Downey, Sergey Feldman
Retrieval-augmented generation is increasingly effective in answering
scientific questions from literature, but many state-of-the-art systems are
expensive and closed-source. We introduce Ai2 Scholar QA, a free online
scientific question answering application. To facilitate research, we make our
entire pipeline public: as a customizable open-source Python package and
interactive web app, along with paper indexes accessible through public APIs
and downloadable datasets. We describe our system in detail and present
experiments analyzing its key design decisions. In an evaluation on a recent
scientific QA benchmark, we find that Ai2 Scholar QA outperforms competing
systems.
comment: 7 pages
☆ Moving Beyond Next-Token Prediction: Transformers are Context-Sensitive Language Generators
Large Language Models (LLMs), powered by Transformers, have demonstrated
human-like intelligence capabilities, yet their underlying mechanisms remain
poorly understood. This paper presents a novel framework for interpreting LLMs
as probabilistic left context-sensitive languages (CSLs) generators. We
hypothesize that Transformers can be effectively decomposed into three
fundamental components: context windows, attention mechanisms, and
autoregressive generation frameworks. This decomposition allows for the
development of more flexible and interpretable computational models, moving
beyond the traditional view of attention and autoregression as inseparable
processes. We argue that next-token predictions can be understood as
probabilistic, dynamic approximations of left CSL production rules, providing
an intuitive explanation for how simple token predictions can yield human-like
intelligence outputs. Given that all CSLs are left context-sensitive
(Penttonen, 1974), we conclude that Transformers stochastically approximate
CSLs, which are widely recognized as models of human-like intelligence. This
interpretation bridges the gap between Formal Language Theory and the observed
generative power of Transformers, laying a foundation for future advancements
in generative AI theory and applications. Our novel perspective on Transformer
architectures will foster a deeper understanding of LLMs and their future
potentials.
comment: 11 pages, 2 figures
☆ CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives
Navigating high-stakes dilemmas involving conflicting values is challenging
even for humans, let alone for AI. Yet prior work in evaluating the reasoning
capabilities of large language models (LLMs) in such situations has been
limited to everyday scenarios. To close this gap, this work first introduces
CLASH (Character perspective-based LLM Assessments in Situations with
High-stakes), a meticulously curated dataset consisting of 345 high-impact
dilemmas along with 3,795 individual perspectives of diverse values. In
particular, we design CLASH in a way to support the study of critical aspects
of value-based decision-making processes which are missing from prior work,
including understanding decision ambivalence and psychological discomfort as
well as capturing the temporal shifts of values in characters' perspectives. By
benchmarking 10 open and closed frontier models, we uncover several key
findings. (1) Even the strongest models, such as GPT-4o and Claude-Sonnet,
achieve less than 50% accuracy in identifying situations where the decision
should be ambivalent, while they perform significantly better in clear-cut
scenarios. (2) While LLMs reasonably predict psychological discomfort as marked
by human, they inadequately comprehend perspectives involving value shifts,
indicating a need for LLMs to reason over complex values. (3) Our experiments
also reveal a significant correlation between LLMs' value preferences and their
steerability towards a given value. (4) Finally, LLMs exhibit greater
steerability when engaged in value reasoning from a third-party perspective,
compared to a first-person setup, though certain value pairs benefit uniquely
from the first-person framing.
☆ CSPLADE: Learned Sparse Retrieval with Causal Language Models
In recent years, dense retrieval has been the focus of information retrieval
(IR) research. While effective, dense retrieval produces uninterpretable dense
vectors, and suffers from the drawback of large index size. Learned sparse
retrieval (LSR) has emerged as promising alternative, achieving competitive
retrieval performance while also being able to leverage the classical inverted
index data structure for efficient retrieval. However, limited works have
explored scaling LSR beyond BERT scale. In this work, we identify two
challenges in training large language models (LLM) for LSR: (1) training
instability during the early stage of contrastive training; (2) suboptimal
performance due to pre-trained LLM's unidirectional attention. To address these
challenges, we propose two corresponding techniques: (1) a lightweight
adaptation training phase to eliminate training instability; (2) two model
variants to enable bidirectional information. With these techniques, we are
able to train LSR models with 8B scale LLM, and achieve competitive retrieval
performance with reduced index size. Furthermore, we are among the first to
analyze the performance-efficiency tradeoff of LLM-based LSR model through the
lens of model quantization. Our findings provide insights into adapting LLMs
for efficient retrieval modeling.
☆ Name of Thrones: Evaluating How LLMs Rank Student Names, Race, and Gender in Status Hierarchies
Across cultures, names tell a lot about their bearers as they carry deep
personal and cultural significance. Names also serve as powerful signals of
gender, race, and status in the social hierarchy - a pecking order in which
individual positions shape others' expectations on their perceived competence
and worth. With the widespread adoption of LLMs and as names are often an input
for LLMs, it is crucial to evaluate whether LLMs may sort people into status
positions based on first and last names and, if so, whether it is in an unfair,
biased fashion. While prior work has primarily investigated biases in first
names, little attention has been paid to last names and even less to the
combined effects of first and last names. In this study, we conduct a
large-scale analysis of name variations across 5 ethnicities to examine how AI
exhibits name biases. Our study investigates three key characteristics of
inequality and finds that LLMs reflect and reinforce status hierarchies based
on names that signal gender and ethnicity as they encode differential
expectations of competence, leadership, and economic potential. Contrary to the
common assumption that AI tends to favor Whites, we show that East and, in some
contexts, South Asian names receive higher rankings. We also disaggregate
Asians, a population projected to be the largest immigrant group in the U.S. by
2055. Our results challenge the monolithic Asian model minority assumption,
illustrating a more complex and stratified model of bias. Gender moderates
biases, with girls facing unfair disadvantages in certain racial groups.
Additionally, spanning cultural categories by adopting Western first names
improves AI-perceived status for East and Southeast Asian students,
particularly for girls. Our findings underscore the importance of
intersectional and more nuanced understandings of race, gender, and mixed
identities in the evaluation of LLMs.
☆ GUM-SAGE: A Novel Dataset and Approach for Graded Entity Salience Prediction
Determining and ranking the most salient entities in a text is critical for
user-facing systems, especially as users increasingly rely on models to
interpret long documents they only partially read. Graded entity salience
addresses this need by assigning entities scores that reflect their relative
importance in a text. Existing approaches fall into two main categories:
subjective judgments of salience, which allow for gradient scoring but lack
consistency, and summarization-based methods, which define salience as
mention-worthiness in a summary, promoting explainability but limiting outputs
to binary labels (entities are either summary-worthy or not). In this paper, we
introduce a novel approach for graded entity salience that combines the
strengths of both approaches. Using an English dataset spanning 12 spoken and
written genres, we collect 5 summaries per document and calculate each entity's
salience score based on its presence across these summaries. Our approach shows
stronger correlation with scores based on human summaries and alignments, and
outperforms existing techniques, including LLMs. We release our data and code
at https://github.com/jl908069/gum_sum_salience to support further research on
graded salient entity extraction.
☆ The Art of Audience Engagement: LLM-Based Thin-Slicing of Scientific Talks
This paper examines the thin-slicing approach - the ability to make accurate
judgments based on minimal information - in the context of scientific
presentations. Drawing on research from nonverbal communication and personality
psychology, we show that brief excerpts (thin slices) reliably predict overall
presentation quality. Using a novel corpus of over one hundred real-life
science talks, we employ Large Language Models (LLMs) to evaluate transcripts
of full presentations and their thin slices. By correlating LLM-based
evaluations of short excerpts with full-talk assessments, we determine how much
information is needed for accurate predictions. Our results demonstrate that
LLM-based evaluations align closely with human ratings, proving their validity,
reliability, and efficiency. Critically, even very short excerpts (less than 10
percent of a talk) strongly predict overall evaluations. This suggests that the
first moments of a presentation convey relevant information that is used in
quality evaluations and can shape lasting impressions. The findings are robust
across different LLMs and prompting strategies. This work extends thin-slicing
research to public speaking and connects theories of impression formation to
LLMs and current research on AI communication. We discuss implications for
communication and social cognition research on message reception. Lastly, we
suggest an LLM-based thin-slicing framework as a scalable feedback tool to
enhance human communication.
♻ ☆ Graph Linearization Methods for Reasoning on Graphs with Large Language Models
Christos Xypolopoulos, Guokan Shang, Xiao Fei, Giannis Nikolentzos, Hadi Abdine, Iakovos Evdaimon, Michail Chatzianastasis, Giorgos Stamou, Michalis Vazirgiannis
Large language models have evolved to process multiple modalities beyond
text, such as images and audio, which motivates us to explore how to
effectively leverage them for graph reasoning tasks. The key question,
therefore, is how to transform graphs into linear sequences of tokens, a
process we term "graph linearization", so that LLMs can handle graphs
naturally. We consider that graphs should be linearized meaningfully to reflect
certain properties of natural language text, such as local dependency and
global alignment, in order to ease contemporary LLMs, trained on trillions of
textual tokens, better understand graphs. To achieve this, we developed several
graph linearization methods based on graph centrality and degeneracy. These
methods are further enhanced using node relabeling techniques. The experimental
results demonstrate the effectiveness of our methods compared to the random
linearization baseline. Our work introduces novel graph representations
suitable for LLMs, contributing to the potential integration of graph machine
learning with the trend of multimodal processing using a unified transformer
model.
♻ ☆ Breaking the Data Barrier -- Building GUI Agents Through Task Generalization
Graphical User Interface (GUI) agents offer cross-platform solutions for
automating complex digital tasks, with significant potential to transform
productivity workflows. However, their performance is often constrained by the
scarcity of high-quality trajectory data. To address this limitation, we
propose training Vision Language Models (VLMs) on data-rich,
reasoning-intensive tasks during a dedicated mid-training stage, and then
examine how incorporating these tasks facilitates generalization to GUI
planning scenarios. Specifically, we explore a range of tasks with readily
available instruction-tuning data, including GUI perception, multimodal
reasoning, and textual reasoning. Through extensive experiments across 11
mid-training tasks, we demonstrate that: (1) Task generalization proves highly
effective, yielding substantial improvements across most settings. For
instance, multimodal mathematical reasoning enhances performance on
AndroidWorld by an absolute 6.3%. Remarkably, text-only mathematical data
significantly boosts GUI web agent performance, achieving a 5.6% improvement on
WebArena and 5.4% improvement on AndroidWorld, underscoring notable cross-modal
generalization from text-based to visual domains; (2) Contrary to prior
assumptions, GUI perception data - previously considered closely aligned with
GUI agent tasks and widely utilized for training - has a comparatively limited
impact on final performance; (3) Building on these insights, we identify the
most effective mid-training tasks and curate optimized mixture datasets,
resulting in absolute performance gains of 8.0% on WebArena and 12.2% on
AndroidWorld. Our work provides valuable insights into cross-domain knowledge
transfer for GUI agents and offers a practical approach to addressing data
scarcity challenges in this emerging field. The code, data and models will be
available at https://github.com/hkust-nlp/GUIMid.
comment: 24 pages, 11 figures
♻ ☆ Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering
Misleading chart visualizations, which intentionally manipulate data
representations to support specific claims, can distort perceptions and lead to
incorrect conclusions. Despite decades of research, misleading visualizations
remain a widespread and pressing issue. Recent advances in multimodal large
language models (MLLMs) have demonstrated strong chart comprehension
capabilities, yet no existing work has systematically evaluated their ability
to detect and interpret misleading charts. This paper introduces the Misleading
Chart Question Answering (Misleading ChartQA) Benchmark, a large-scale
multimodal dataset designed to assess MLLMs in identifying and reasoning about
misleading charts. It contains over 3,000 curated examples, covering 21 types
of misleaders and 10 chart types. Each example includes standardized chart
code, CSV data, and multiple-choice questions with labeled explanations,
validated through multi-round MLLM checks and exhausted expert human review. We
benchmark 16 state-of-the-art MLLMs on our dataset, revealing their limitations
in identifying visually deceptive practices. We also propose a novel pipeline
that detects and localizes misleaders, enhancing MLLMs' accuracy in misleading
chart interpretation. Our work establishes a foundation for advancing
MLLM-driven misleading chart comprehension. We publicly release the sample
dataset to support further research in this critical area.
comment: 31 pages in total. Under Review
♻ ☆ Automatic Item Generation for Personality Situational Judgment Tests with Large Language Models
Personality assessment, particularly through situational judgment tests
(SJTs), is a vital tool for psychological research, talent selection, and
educational evaluation. This study explores the potential of GPT-4, a
state-of-the-art large language model (LLM), to automate the generation of
personality situational judgment tests (PSJTs) in Chinese. Traditional SJT
development is labor-intensive and prone to biases, while GPT-4 offers a
scalable, efficient alternative. Two studies were conducted: Study 1 evaluated
the impact of prompt design and temperature settings on content validity,
finding that optimized prompts with a temperature of 1.0 produced creative and
accurate items. Study 2 assessed the psychometric properties of GPT-4-generated
PSJTs, revealing that they demonstrated satisfactory reliability and validity,
surpassing the performance of manually developed tests in measuring the Big
Five personality traits. This research highlights GPT-4's effectiveness in
developing high-quality PSJTs, providing a scalable and innovative method for
psychometric test development. These findings expand the possibilities of
automatic item generation and the application of LLMs in psychology, and offer
practical implications for streamlining test development processes in
resource-limited settings.
comment: Submitted to Computers in Human Behavior Reports. 54 pages (main
text), 12 pages (appendix), and 5 figures
♻ ☆ Lateral Phishing With Large Language Models: A Large Organization Comparative Study
Mazal Bethany, Athanasios Galiopoulos, Emet Bethany, Mohammad Bahrami Karkevandi, Nicole Beebe, Nishant Vishwamitra, Peyman Najafirad
The emergence of Large Language Models (LLMs) has heightened the threat of
phishing emails by enabling the generation of highly targeted, personalized,
and automated attacks. Traditionally, many phishing emails have been
characterized by typos, errors, and poor language. These errors can be
mitigated by LLMs, potentially lowering the barrier for attackers. Despite
this, there is a lack of large-scale studies comparing the effectiveness of
LLM-generated lateral phishing emails to those crafted by humans. Current
literature does not adequately address the comparative effectiveness of LLM and
human-generated lateral phishing emails in a real-world, large-scale
organizational setting, especially considering the potential for LLMs to
generate more convincing and error-free phishing content. To address this gap,
we conducted a pioneering study within a large university, targeting its
workforce of approximately 9,000 individuals including faculty, staff,
administrators, and student workers. Our results indicate that LLM-generated
lateral phishing emails are as effective as those written by communications
professionals, emphasizing the critical threat posed by LLMs in leading
phishing campaigns. We break down the results of the overall phishing
experiment, comparing vulnerability between departments and job roles.
Furthermore, to gather qualitative data, we administered a detailed
questionnaire, revealing insights into the reasons and motivations behind
vulnerable employee's actions. This study contributes to the understanding of
cyber security threats in educational institutions and provides a comprehensive
comparison of LLM and human-generated phishing emails' effectiveness,
considering the potential for LLMs to generate more convincing content. The
findings highlight the need for enhanced user education and system defenses to
mitigate the growing threat of AI-powered phishing attacks.
comment: Accepted for publication in IEEE Access. This version includes
revisions following peer review
♻ ☆ CMAT: A Multi-Agent Collaboration Tuning Framework for Enhancing Small Language Models
Xuechen Liang, Yangfan He, Meiling Tao, Yinghui Xia, Jianhui Wang, Tianyu Shi, Jun Wang, JingSong Yang
Open large language models (LLMs) have significantly advanced the field of
natural language processing, showcasing impressive performance across various
tasks.Despite the significant advancements in LLMs, their effective operation
still relies heavily on human input to accurately guide the dialogue flow, with
agent tuning being a crucial optimization technique that involves human
adjustments to the model for better response to such guidance.Addressing this
dependency, our work introduces the TinyAgent model, trained on a meticulously
curated high-quality dataset. We also present the Collaborative Multi-Agent
Tuning (CMAT) framework, an innovative system designed to augment language
agent capabilities through adaptive weight updates based on environmental
feedback. This framework fosters collaborative learning and real-time
adaptation among multiple intelligent agents, enhancing their context-awareness
and long-term memory. In this research, we propose a new communication agent
framework that integrates multi-agent systems with environmental feedback
mechanisms, offering a scalable method to explore cooperative behaviors.
Notably, our TinyAgent-7B model exhibits performance on par with GPT-3.5,
despite having fewer parameters, signifying a substantial improvement in the
efficiency and effectiveness of LLMs.
♻ ☆ Enhancing Commentary Strategies for Imperfect Information Card Games: A Study of Large Language Models in Guandan Commentary
Recent advancements in large language models (LLMs) have unlocked the
potential for generating high-quality game commentary. However, producing
insightful and engaging commentary for complex games with incomplete
information remains a significant challenge. In this paper, we introduce a
novel commentary method that combine Reinforcement Learning (RL) and LLMs,
tailored specifically for the Chinese card game \textit{Guandan}. Our system
leverages RL to generate intricate card-playing scenarios and employs LLMs to
generate corresponding commentary text, effectively emulating the strategic
analysis and narrative prowess of professional commentators. The framework
comprises a state commentary guide, a Theory of Mind (ToM)-based strategy
analyzer, and a style retrieval module, which seamlessly collaborate to deliver
detailed and context-relevant game commentary in the Chinese language
environment. We empower LLMs with ToM capabilities and refine both retrieval
and information filtering mechanisms. This facilitates the generation of
personalized commentary content. Our experimental results showcase the
substantial enhancement in performance achieved by the proposed commentary
framework when applied to open-source LLMs, surpassing the performance of GPT-4
across multiple evaluation metrics.
♻ ☆ MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages
We present MultiLoKo, a new benchmark for evaluating multilinguality in LLMs
covering 31 languages. MultiLoKo consists of three partitions: a main partition
consisting of 500 questions per language, separately sourced to be locally
relevant to the specific language, and two translated partitions, containing
human-authored translations from 30 non-English languages to English and vice
versa. For comparison, we also release corresponding machine-authored
translations. The data is equally distributed over two splits: a dev split and
a blind, out-of-distribution test split. MultiLoKo can be used to study a
variety of questions regarding the multilinguality of LLMs as well as
meta-questions about multilingual benchmark creation. We compute MultiLoKo
scores for 11 base and chat models marketed to be multilingual and study their
average performance, their performance parity across languages, how much their
ability to answer questions depends on the question language, and which
languages are most difficult. None of the models we studied performs well on
MultiLoKo, as indicated by low average scores as well as large differences
between the best and worst scoring languages. Furthermore, we find a
substantial effect of the question language, indicating sub-optimal knowledge
transfer between languages. Lastly, we find that using local vs
English-translated data can result in differences more than 20 points for the
best performing models, drastically change the estimated difficulty of some
languages. For using machines instead of human translations, we find a weaker
effect on ordering of language difficulty, a larger difference in model
rankings, and a substantial drop in estimated performance for all models.
♻ ☆ GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
Existing efforts in building Graphical User Interface (GUI) agents largely
rely on the training paradigm of supervised fine-tuning on Large
Vision-Language Models (LVLMs). However, this approach not only demands
extensive amounts of training data but also struggles to effectively understand
GUI screenshots and generalize to unseen interfaces. The issue significantly
limits its application in real-world scenarios, especially for high-level
tasks. Inspired by Reinforcement Fine-Tuning (RFT) in large reasoning models
(e.g., DeepSeek-R1), which efficiently enhances the problem-solving
capabilities of large language models in real-world settings, we propose \name,
the first reinforcement learning framework designed to enhance the GUI
capabilities of LVLMs in high-level real-world task scenarios, through unified
action space rule modeling. By leveraging a small amount of carefully curated
high-quality data across multiple platforms (including Windows, Linux, MacOS,
Android, and Web) and employing policy optimization algorithms such as Group
Relative Policy Optimization (GRPO) to update the model, \name achieves
superior performance using only 0.02\% of the data (3K vs. 13M) compared to
previous state-of-the-art methods like OS-Atlas across eight benchmarks
spanning three different platforms (mobile, desktop, and web). These results
demonstrate the immense potential of reinforcement learning based on unified
action space rule modeling in improving the execution capabilities of LVLMs for
real-world GUI agent tasks.
♻ ☆ SafeChat: A Framework for Building Trustworthy Collaborative Assistants and a Case Study of its Usefulness
Collaborative assistants, or chatbots, are data-driven decision support
systems that enable natural interaction for task completion. While they can
meet critical needs in modern society, concerns about their reliability and
trustworthiness persist. In particular, Large Language Model (LLM)-based
chatbots like ChatGPT, Gemini, and DeepSeek are becoming more accessible.
However, such chatbots have limitations, including their inability to explain
response generation, the risk of generating problematic content, the lack of
standardized testing for reliability, and the need for deep AI expertise and
extended development times. These issues make chatbots unsuitable for
trust-sensitive applications like elections or healthcare. To address these
concerns, we introduce SafeChat, a general architecture for building safe and
trustworthy chatbots, with a focus on information retrieval use cases. Key
features of SafeChat include: (a) safety, with a domain-agnostic design where
responses are grounded and traceable to approved sources (provenance), and
'do-not-respond' strategies to prevent harmful answers; (b) usability, with
automatic extractive summarization of long responses, traceable to their
sources, and automated trust assessments to communicate expected chatbot
behavior, such as sentiment; and (c) fast, scalable development, including a
CSV-driven workflow, automated testing, and integration with various devices.
We implemented SafeChat in an executable framework using the open-source
chatbot platform Rasa. A case study demonstrates its application in building
ElectionBot-SC, a chatbot designed to safely disseminate official election
information. SafeChat is being used in many domains, validating its potential,
and is available at: https://github.com/ai4society/trustworthy-chatbot.
♻ ☆ Towards Predictive Communication with Brain-Computer Interfaces integrating Large Language Models
This perspective article aims at providing an outline of the state of the art
and future developments towards the integration of cutting-edge predictive
language models with BCI. A synthetic overview of early and more recent
linguistic models, from natural language processing (NLP) models to recent LLM,
that to a varying extent improved predictive writing systems, is first
provided. Second, a summary of previous BCI implementations integrating
language models is presented. The few preliminary studies investigating the
possible combination of LLM with BCI spellers to efficiently support fast
communication and control are then described. Finally, current challenges and
limitations towards the full integration of LLM with BCI systems are discussed.
Recent investigations suggest that the combination of LLM with BCI might
drastically improve human-computer interaction in patients with motor or
language disorders as well as in healthy individuals. In particular, the
pretrained autoregressive transformer models, such as GPT, that capitalize from
parallelization, learning through pre-training and fine-tuning, promise a
substantial improvement of BCI for communication with respect to previous
systems incorporating simpler language models. Indeed, among various models,
the GPT-2 was shown to represent an excellent candidate for its integration
into BCI although testing was only perfomed on simulated conversations and not
on real BCI scenarios. Prospectively, the full integration of LLM with advanced
BCI systems might lead to a big leap forward towards fast, efficient and
user-adaptive neurotechnology.
comment: needs major revision
♻ ☆ Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models
NVIDIA, :, Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad Bercovich, Aleksander Ficek, Alexis Bjorlin, Ali Taghibakhshi, Amala Sanjay Deshmukh, Ameya Sunil Mahabaleshwarkar, Andrew Tao, Anna Shors, Ashwath Aithal, Ashwin Poojary, Ayush Dattagupta, Balaram Buddharaju, Bobby Chen, Boris Ginsburg, Boxin Wang, Brandon Norick, Brian Butterfield, Bryan Catanzaro, Carlo del Mundo, Chengyu Dong, Christine Harvey, Christopher Parisien, Dan Su, Daniel Korzekwa, Danny Yin, Daria Gitman, David Mosallanezhad, Deepak Narayanan, Denys Fridman, Dima Rekesh, Ding Ma, Dmytro Pykhtar, Dong Ahn, Duncan Riach, Dusan Stosic, Eileen Long, Elad Segal, Ellie Evans, Eric Chung, Erick Galinkin, Evelina Bakhturina, Ewa Dobrowolska, Fei Jia, Fuxiao Liu, Gargi Prasad, Gerald Shen, Guilin Liu, Guo Chen, Haifeng Qian, Helen Ngo, Hongbin Liu, Hui Li, Igor Gitman, Ilia Karmanov, Ivan Moshkov, Izik Golan, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jarno Seppanen, Jason Lu, Jason Sewall, Jiaqi Zeng, Jiaxuan You, Jimmy Zhang, Jing Zhang, Jining Huang, Jinze Xue, Jocelyn Huang, Joey Conway, John Kamalu, Jon Barker, Jonathan Cohen, Joseph Jennings, Jupinder Parmar, Karan Sapra, Kari Briski, Kateryna Chumachenko, Katherine Luna, Keshav Santhanam, Kezhi Kong, Kirthi Sivamani, Krzysztof Pawelec, Kumar Anik, Kunlun Li, Lawrence McAfee, Leon Derczynski, Lindsey Pavao, Luis Vega, Lukas Voegtle, Maciej Bala, Maer Rodrigues de Melo, Makesh Narsimhan Sreedhar, Marcin Chochowski, Markus Kliegl, Marta Stepniewska-Dziubinska, Matthieu Le, Matvei Novikov, Mehrzad Samadi, Michael Andersch, Michael Evans, Miguel Martinez, Mike Chrzanowski, Mike Ranzinger, Mikolaj Blaz, Misha Smelyanskiy, Mohamed Fawzy, Mohammad Shoeybi, Mostofa Patwary, Nayeon Lee, Nima Tajbakhsh, Ning Xu, Oleg Rybakov, Oleksii Kuchaiev, Olivier Delalleau, Osvald Nitski, Parth Chadha, Pasha Shamis, Paulius Micikevicius, Pavlo Molchanov, Peter Dykas, Philipp Fischer, Pierre-Yves Aquilanti, Piotr Bialecki, Prasoon Varshney, Pritam Gundecha, Przemek Tredak, Rabeeh Karimi, Rahul Kandu, Ran El-Yaniv, Raviraj Joshi, Roger Waleffe, Ruoxi Zhang, Sabrina Kavanaugh, Sahil Jain, Samuel Kriman, Sangkug Lym, Sanjeev Satheesh, Saurav Muralidharan, Sean Narenthiran, Selvaraj Anandaraj, Seonmyeong Bak, Sergey Kashirsky, Seungju Han, Shantanu Acharya, Shaona Ghosh, Sharath Turuvekere Sreenivas, Sharon Clay, Shelby Thomas, Shrimai Prabhumoye, Shubham Pachori, Shubham Toshniwal, Shyamala Prayaga, Siddhartha Jain, Sirshak Das, Slawek Kierat, Somshubra Majumdar, Song Han, Soumye Singhal, Sriharsha Niverty, Stefania Alborghetti, Suseella Panguluri, Swetha Bhendigeri, Syeda Nahida Akter, Szymon Migacz, Tal Shiri, Terry Kong, Timo Roman, Tomer Ronen, Trisha Saar, Tugrul Konuk, Tuomas Rintamaki, Tyler Poon, Ushnish De, Vahid Noroozi, Varun Singh, Vijay Korthikanti, Vitaly Kurin, Wasi Uddin Ahmad, Wei Du, Wei Ping, Wenliang Dai, Wonmin Byeon, Xiaowei Ren, Yao Xu, Yejin Choi, Yian Zhang, Ying Lin, Yoshi Suhara, Zhiding Yu, Zhiqi Li, Zhiyu Li, Zhongbo Zhu, Zhuolin Yang, Zijia Chen
As inference-time scaling becomes critical for enhanced reasoning
capabilities, it is increasingly becoming important to build models that are
efficient to infer. We introduce Nemotron-H, a family of 8B and 56B/47B hybrid
Mamba-Transformer models designed to reduce inference cost for a given accuracy
level. To achieve this goal, we replace the majority of self-attention layers
in the common Transformer model architecture with Mamba layers that perform
constant computation and require constant memory per generated token. We show
that Nemotron-H models offer either better or on-par accuracy compared to other
similarly-sized state-of-the-art open-sourced Transformer models (e.g.,
Qwen-2.5-7B/72B and Llama-3.1-8B/70B), while being up to 3$\times$ faster at
inference. To further increase inference speed and reduce the memory required
at inference time, we created Nemotron-H-47B-Base from the 56B model using a
new compression via pruning and distillation technique called MiniPuzzle.
Nemotron-H-47B-Base achieves similar accuracy to the 56B model, but is 20%
faster to infer. In addition, we introduce an FP8-based training recipe and
show that it can achieve on par results with BF16-based training. This recipe
is used to train the 56B model. We are releasing Nemotron-H base model
checkpoints with support in Hugging Face and NeMo.
♻ ☆ Retro-Search: Exploring Untaken Paths for Deeper and Efficient Reasoning
Ximing Lu, Seungju Han, David Acuna, Hyunwoo Kim, Jaehun Jung, Shrimai Prabhumoye, Niklas Muennighoff, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi
Large reasoning models exhibit remarkable reasoning capabilities via long,
elaborate reasoning trajectories. Supervised fine-tuning on such reasoning
traces, also known as distillation, can be a cost-effective way to boost
reasoning capabilities of student models. However, empirical observations
reveal that these reasoning trajectories are often suboptimal, switching
excessively between different lines of thought, resulting in under-thinking,
over-thinking, and even degenerate responses. We introduce Retro-Search, an
MCTS-inspired search algorithm, for distilling higher quality reasoning paths
from large reasoning models. Retro-Search retrospectively revises reasoning
paths to discover better, yet shorter traces, which can then lead to student
models with enhanced reasoning capabilities with shorter, thus faster
inference. Our approach can enable two use cases: self-improvement, where
models are fine-tuned on their own Retro-Search-ed thought traces, and
weak-to-strong improvement, where a weaker model revises stronger model's
thought traces via Retro-Search. For self-improving, R1-distill-7B, fine-tuned
on its own Retro-Search-ed traces, reduces the average reasoning length by
31.2% while improving performance by 7.7% across seven math benchmarks. For
weak-to-strong improvement, we retrospectively revise R1-671B's traces from the
OpenThoughts dataset using R1-distill-32B as the Retro-Search-er, a model 20x
smaller. Qwen2.5-32B, fine-tuned on this refined data, achieves performance
comparable to R1-distill-32B, yielding an 11.3% reduction in reasoning length
and a 2.4% performance improvement compared to fine-tuning on the original
OpenThoughts data. Our work counters recently emergent viewpoints that question
the relevance of search algorithms in the era of large reasoning models, by
demonstrating that there are still opportunities for algorithmic advancements,
even for frontier models.
comment: Code and data will be publicly released upon internal approval
♻ ☆ Analyzing 16,193 LLM Papers for Fun and Profits
Large Language Models (LLMs) are reshaping the landscape of computer science
research, driving significant shifts in research priorities across diverse
conferences and fields. This study provides a comprehensive analysis of the
publication trend of LLM-related papers in 77 top-tier computer science
conferences over the past six years (2019-2024). We approach this analysis from
four distinct perspectives: (1) We investigate how LLM research is driving
topic shifts within major conferences. (2) We adopt a topic modeling approach
to identify various areas of LLM-related topic growth and reveal the topics of
concern at different conferences. (3) We explore distinct contribution patterns
of academic and industrial institutions. (4) We study the influence of national
origins on LLM development trajectories. Synthesizing the findings from these
diverse analytical angles, we derive ten key insights that illuminate the
dynamics and evolution of the LLM research ecosystem.
♻ ☆ What is the Role of Small Models in the LLM Era: A Survey
Large Language Models (LLMs) have made significant progress in advancing
artificial general intelligence (AGI), leading to the development of
increasingly large models such as GPT-4 and LLaMA-405B. However, scaling up
model sizes results in exponentially higher computational costs and energy
consumption, making these models impractical for academic researchers and
businesses with limited resources. At the same time, Small Models (SMs) are
frequently used in practical settings, although their significance is currently
underestimated. This raises important questions about the role of small models
in the era of LLMs, a topic that has received limited attention in prior
research. In this work, we systematically examine the relationship between LLMs
and SMs from two key perspectives: Collaboration and Competition. We hope this
survey provides valuable insights for practitioners, fostering a deeper
understanding of the contribution of small models and promoting more efficient
use of computational resources. The code is available at
https://github.com/tigerchen52/role_of_small_models
comment: a survey paper of small models
♻ ☆ VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge
Current multimodal benchmarks often conflate reasoning with domain-specific
knowledge, making it difficult to isolate and evaluate general reasoning
abilities in non-expert settings. To address this, we introduce VisualPuzzles,
a benchmark that targets visual reasoning while deliberately minimizing
reliance on specialized knowledge. VisualPuzzles consists of diverse questions
spanning five categories: algorithmic, analogical, deductive, inductive, and
spatial reasoning. One major source of our questions is manually translated
logical reasoning questions from the Chinese Civil Service Examination.
Experiments show that VisualPuzzles requires significantly less intensive
domain-specific knowledge and more complex reasoning compared to benchmarks
like MMMU, enabling us to better evaluate genuine multimodal reasoning.
Evaluations show that state-of-the-art multimodal large language models
consistently lag behind human performance on VisualPuzzles, and that strong
performance on knowledge-intensive benchmarks does not necessarily translate to
success on reasoning-focused, knowledge-light tasks. Additionally, reasoning
enhancements such as scaling up inference compute (with "thinking" modes) yield
inconsistent gains across models and task types, and we observe no clear
correlation between model size and performance. We also found that models
exhibit different reasoning and answering patterns on VisualPuzzles compared to
benchmarks with heavier emphasis on knowledge. VisualPuzzles offers a clearer
lens through which to evaluate reasoning capabilities beyond factual recall and
domain knowledge.
comment: 56 pages, 43 figures
♻ ☆ What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness
Zhihang Liu, Chen-Wei Xie, Bin Wen, Feiwu Yu, Jixuan Chen, Boqiang Zhang, Nianzu Yang, Pandeng Li, Yinglu Li, Zuan Gao, Yun Zheng, Hongtao Xie
Visual captioning benchmarks have become outdated with the emergence of
modern multimodal large language models (MLLMs), as the brief ground-truth
sentences and traditional metrics fail to assess detailed captions effectively.
While recent benchmarks attempt to address this by focusing on keyword
extraction or object-centric evaluation, they remain limited to vague-view or
object-view analyses and incomplete visual element coverage. In this paper, we
introduce CAPability, a comprehensive multi-view benchmark for evaluating
visual captioning across 12 dimensions spanning six critical views. We curate
nearly 11K human-annotated images and videos with visual element annotations to
evaluate the generated captions. CAPability stably assesses both the
correctness and thoroughness of captions using F1-score. By converting
annotations to QA pairs, we further introduce a heuristic metric, \textit{know
but cannot tell} ($K\bar{T}$), indicating a significant performance gap between
QA and caption capabilities. Our work provides the first holistic analysis of
MLLMs' captioning abilities, as we identify their strengths and weaknesses
across various dimensions, guiding future research to enhance specific aspects
of capabilities.
♻ ☆ Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding
In recent years, text-to-image (T2I) generation models have made significant
progress in generating high-quality images that align with text descriptions.
However, these models also face the risk of unsafe generation, potentially
producing harmful content that violates usage policies, such as explicit
material. Existing safe generation methods typically focus on suppressing
inappropriate content by erasing undesired concepts from visual
representations, while neglecting to sanitize the textual representation.
Although these methods help mitigate the risk of misuse to some extent, their
robustness remains insufficient when dealing with adversarial attacks.
Given that semantic consistency between input text and output image is a core
requirement of T2I models, we identify that textual representations are likely
the primary source of unsafe generation. To this end, we propose Embedding
Sanitizer (ES), which enhances the safety of T2I models by sanitizing
inappropriate concepts in prompt embeddings. To our knowledge, ES is the first
interpretable safe generation framework that assigns a score to each token in
the prompt to indicate its potential harmfulness. In addition, ES adopts a
plug-and-play modular design, offering compatibility for seamless integration
with various T2I models and other safeguards. Evaluations on five prompt
benchmarks show that ES outperforms eleven existing safeguard baselines,
achieving state-of-the-art robustness while maintaining high-quality image
generation.
♻ ☆ Can you map it to English? The Role of Cross-Lingual Alignment in Multilingual Performance of LLMs
Large language models (LLMs) pre-trained predominantly on English text
exhibit surprising multilingual capabilities, yet the mechanisms driving
cross-lingual generalization remain poorly understood. This work investigates
how the alignment of representations for text written in different languages
correlates with LLM performance on natural language understanding tasks and
translation tasks, both at the language and the instance level. For this
purpose, we introduce cross-lingual alignment metrics such as the
Discriminative Alignment Index (DALI) to quantify the alignment at an instance
level for discriminative tasks. Through experiments on three natural language
understanding tasks (Belebele, XStoryCloze, XCOPA), and machine translation, we
find that while cross-lingual alignment metrics strongly correlate with task
accuracy at the language level, the sample-level alignment often fails to
distinguish correct from incorrect predictions, exposing alignment as a
necessary but insufficient condition for success.
♻ ☆ Unchecked and Overlooked: Addressing the Checkbox Blind Spot in Large Language Models with CheckboxQA
Checkboxes are critical in real-world document processing where the presence
or absence of ticks directly informs data extraction and decision-making
processes. Yet, despite the strong performance of Large Vision and Language
Models across a wide range of tasks, they struggle with interpreting checkable
content. This challenge becomes particularly pressing in industries where a
single overlooked checkbox may lead to costly regulatory or contractual
oversights. To address this gap, we introduce the CheckboxQA dataset, a
targeted resource designed to evaluate and improve model performance on
checkbox-related tasks. It reveals the limitations of current models and serves
as a valuable tool for advancing document comprehension systems, with
significant implications for applications in sectors such as legal tech and
finance.
The dataset is publicly available at:
https://github.com/Snowflake-Labs/CheckboxQA
♻ ☆ Causal Graphical Models for Vision-Language Compositional Understanding ICLR 2025
Recent work has empirically shown that Vision-Language Models (VLMs) struggle
to fully understand the compositional properties of the human language, usually
modeling an image caption as a "bag of words". As a result, they perform poorly
on compositional tasks, which require a deeper understanding of the different
entities of a sentence (subject, verb, etc.) jointly with their mutual
relationships in order to be solved. In this paper, we model the dependency
relations among textual and visual tokens using a Causal Graphical Model (CGM),
built using a dependency parser, and we train a decoder conditioned by the VLM
visual encoder. Differently from standard autoregressive or parallel
predictions, our decoder's generative process is partially-ordered following
the CGM structure. This structure encourages the decoder to learn only the main
causal dependencies in a sentence discarding spurious correlations. Using
extensive experiments on five compositional benchmarks, we show that our method
significantly outperforms all the state-of-the-art compositional approaches by
a large margin, and it also improves over methods trained using much larger
datasets.
comment: Accepted at ICLR 2025
♻ ☆ Fine-tuning Whisper on Low-Resource Languages for Real-World Applications
This paper presents a new approach to fine-tuning OpenAI's Whisper model for
low-resource languages by introducing a novel data generation method that
converts sentence-level data into a long-form corpus, using Swiss German as a
case study. Non-sentence-level data, which could improve the performance of
long-form audio, is difficult to obtain and often restricted by copyright laws.
Our method bridges this gap by transforming more accessible sentence-level data
into a format that preserves the model's ability to handle long-form audio and
perform segmentation without requiring non-sentence-level data. Our data
generation process improves performance in several real-world applications and
leads to the development of a new state-of-the-art speech-to-text (STT) model
for Swiss German. We compare our model with a non-fine-tuned Whisper and our
previous state-of-the-art Swiss German STT models, where our new model achieves
higher BLEU scores. Our results also indicate that the proposed method is
adaptable to other low-resource languages, supported by written guidance and
code that allows the creation of fine-tuned Whisper models, which keep
segmentation capabilities and allow the transcription of longer audio files
using only sentence-level data with high quality.
♻ ☆ SEA-LION: Southeast Asian Languages in One Network
Raymond Ng, Thanh Ngan Nguyen, Yuli Huang, Ngee Chia Tai, Wai Yi Leong, Wei Qi Leong, Xianbin Yong, Jian Gang Ngui, Yosephine Susanto, Nicholas Cheng, Hamsawardhini Rengarajan, Peerat Limkonchotiwat, Adithya Venkatadri Hulagadri, Kok Wai Teng, Yeo Yeow Tong, Bryan Siow, Wei Yi Teo, Wayne Lau, Choon Meng Tan, Brandon Ong, Zhi Hao Ong, Jann Railey Montalan, Adwin Chan, Sajeban Antonyrex, Ren Lee, Esther Choa, David Ong Tat-Wee, Bing Jie Darius Liu, William Chandra Tjhi, Erik Cambria, Leslie Teo
Recently, Large Language Models (LLMs) have dominated much of the artificial
intelligence scene with their ability to process and generate natural
languages. However, the majority of LLM research and development remains
English-centric, leaving low-resource languages such as those in the Southeast
Asian (SEA) region under-represented. To address this representation gap, we
introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge
multilingual LLMs designed for SEA languages. The SEA-LION family of LLMs
supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese,
Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. Our work leverages
large-scale multilingual continued pre-training with a comprehensive
post-training regime involving multiple stages of instruction fine-tuning,
alignment, and model merging. Evaluation results on multilingual benchmarks
indicate that our models achieve state-of-the-art performance across LLMs
supporting SEA languages. We open-source the models to benefit the wider SEA
community.
comment: We released our model at
https://huggingface.co/collections/aisingapore/sea-lionv3-672589a39cdadd6a5b199581
♻ ☆ Do "New Snow Tablets" Contain Snow? Large Language Models Over-Rely on Names to Identify Ingredients of Chinese Drugs
Traditional Chinese Medicine (TCM) has seen increasing adoption in
healthcare, with specialized Large Language Models (LLMs) emerging to support
clinical applications. A fundamental requirement for these models is accurate
identification of TCM drug ingredients. In this paper, we evaluate how general
and TCM-specialized LLMs perform when identifying ingredients of Chinese drugs.
Our systematic analysis reveals consistent failure patterns: models often
interpret drug names literally, overuse common herbs regardless of relevance,
and exhibit erratic behaviors when faced with unfamiliar formulations. LLMs
also fail to understand the verification task. These findings demonstrate that
current LLMs rely primarily on drug names rather than possessing systematic
pharmacological knowledge. To address these limitations, we propose a Retrieval
Augmented Generation (RAG) approach focused on ingredient names. Experiments
across 220 TCM formulations show our method significantly improves accuracy
from approximately 50% to 82% in ingredient verification tasks. Our work
highlights critical weaknesses in current TCM-specific LLMs and offers a
practical solution for enhancing their clinical reliability.
♻ ☆ Teaching Transformers Causal Reasoning through Axiomatic Training
Aniket Vashishtha, Abhinav Kumar, Atharva Pandey, Abbavaram Gowtham Reddy, Kabir Ahuja, Vineeth N Balasubramanian, Amit Sharma
For text-based AI systems to interact in the real world, causal reasoning is
an essential skill. Since active interventions are costly, we study to what
extent a system can learn causal reasoning from symbolic demonstrations of
causal axioms. Specifically, we present an axiomatic training method where the
system learns from multiple demonstrations of a causal axiom (or rule), rather
than incorporating the axiom as an inductive bias or inferring it from data
values. A key question is whether the system would learn to generalize from the
axiom demonstrations to more complex scenarios. Our results, based on applying
axiomatic training to learn the transitivity axiom and d-separation rule,
indicate that such generalization is possible. To avoid data contamination
issues, we start with a 67 million parameter transformer model and train it
from scratch. On both tasks, we find that a model trained on linear causal
chains (along with some noisy variations) can generalize well to complex
graphs, including longer causal chains, causal chains with reversed order, and
graphs with branching.To handle diverse text inputs, the same method is
extended to finetune language models. Finetuning Llama-3.1 8B model on our
axiomatic data leads to significant gains on causal benchmarks such as
Corr2Cause and CLEAR, in some cases providing state-of-the-art performance
surpassing GPT-4.
♻ ☆ ELTEX: A Framework for Domain-Driven Synthetic Data Generation
We introduce Efficient LLM Token Extraction (ELTEX), a framework addressing
the critical challenge of LLM domain specialization by systematically
extracting and integrating domain indicators throughout synthetic data
generation. Unlike approaches relying on implicit knowledge transfer, ELTEX
explicitly leverages domain signals to maintain specialized knowledge
integrity. In our cybersecurity case study, ELTEX-enhanced data enables a
fine-tuned Gemma-2B model to achieve performance competitive with GPT-4o on
blockchain cyberattack classification while reducing computational
requirements. Our Google Sheets implementation makes ELTEX accessible to
non-technical users. Our contributions include: (1) the ELTEX framework; (2)
Google Sheets Add-on implementation; (3) empirical validation showing how ELTEX
bridges performance gaps between small and large models; and (4) a synthetic
dataset of 11,448 texts for blockchain cyberattack detection.
♻ ☆ Preference-based Learning with Retrieval Augmented Generation for Conversational Question Answering WWW 2025
Conversational Question Answering (ConvQA) involves multiple subtasks, i) to
understand incomplete questions in their context, ii) to retrieve relevant
information, and iii) to generate answers. This work presents PRAISE, a
pipeline-based approach for ConvQA that trains LLM adapters for each of the
three subtasks. As labeled training data for individual subtasks is unavailable
in practice, PRAISE learns from its own generations using the final answering
performance as feedback signal without human intervention and treats
intermediate information, like relevant evidence, as weakly labeled data. We
apply Direct Preference Optimization by contrasting successful and unsuccessful
samples for each subtask. In our experiments, we show the effectiveness of this
training paradigm: PRAISE shows improvements per subtask and achieves new
state-of-the-art performance on a popular ConvQA benchmark, by gaining 15.5
percentage points increase in precision over baselines.
comment: WWW 2025 Short Paper, 5 pages
♻ ☆ ClinicalGPT-R1: Pushing reasoning capability of generalist disease diagnosis with large language model
Wuyang Lan, Wenzheng Wang, Changwei Ji, Guoxing Yang, Yongbo Zhang, Xiaohong Liu, Song Wu, Guangyu Wang
Recent advances in reasoning with large language models (LLMs)has shown
remarkable reasoning capabilities in domains such as mathematics and coding,
yet their application to clinical diagnosis remains underexplored. Here, we
introduce ClinicalGPT-R1, a reasoning enhanced generalist large language model
for disease diagnosis. Trained on a dataset of 20,000 real-world clinical
records, ClinicalGPT-R1 leverages diverse training strategies to enhance
diagnostic reasoning. To benchmark performance, we curated MedBench-Hard, a
challenging dataset spanning seven major medical specialties and representative
diseases. Experimental results demonstrate that ClinicalGPT-R1 outperforms
GPT-4o in Chinese diagnostic tasks and achieves comparable performance to GPT-4
in English settings. This comparative study effectively validates the superior
performance of ClinicalGPT-R1 in disease diagnosis tasks. Resources are
available at https://github.com/medfound/medfound.
comment: 8 pages, 6 figures
♻ ☆ Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts
Linear Sequence Modeling (LSM) like linear attention, state space models and
linear RNNs, and Mixture-of-Experts (MoE) have recently emerged as significant
architectural improvements. In this paper, we introduce Linear-MoE, a
production-level system for modeling and training large-scale models that
integrate LSM with MoE. Linear-MoE leverages the advantages of both LSM modules
for linear-complexity sequence modeling and MoE layers for sparsely activation,
aiming to offer high performance with efficient training. The Linear-MoE system
comprises: 1) Modeling subsystem, which provides a unified framework supporting
all instances of LSM. and 2) Training subsystem, which facilitates efficient
training by incorporating various advanced parallelism technologies,
particularly Sequence Parallelism designed for Linear-MoE models. Additionally,
we explore hybrid models that combine Linear-MoE layers with standard
Transformer-MoE layers with its Sequence Parallelism to further enhance model
flexibility and performance. Evaluations on two model series, A0.3B-2B and
A1B-7B, demonstrate Linear-MoE achieves efficiency gains while maintaining
competitive performance on various benchmarks, showcasing its potential as a
next-generation foundational model architecture. Code:
https://github.com/OpenSparseLLMs/Linear-MoE.
comment: Technical report, 17 pages
♻ ☆ Towards Hierarchical Multi-Agent Workflows for Zero-Shot Prompt Optimization
Large language models (LLMs) have shown great progress in responding to user
questions, allowing for a multitude of diverse applications. Yet, the quality
of LLM outputs heavily depends on the prompt design, where a good prompt might
enable the LLM to answer a very challenging question correctly. Therefore,
recent works have developed many strategies for improving the prompt, including
both manual crafting and in-domain optimization. However, their efficacy in
unrestricted scenarios remains questionable, as the former depends on human
design for specific questions and the latter usually generalizes poorly to
unseen scenarios. To address these problems, we give LLMs the freedom to design
the best prompts according to themselves. Specifically, we include a hierarchy
of LLMs, first constructing a prompt with precise instructions and accurate
wording in a hierarchical manner, and then using this prompt to generate the
final answer to the user query. We term this pipeline Hierarchical Multi-Agent
Workflow, or HMAW. In contrast with prior works, HMAW imposes no human
restriction and requires no training, and is completely task-agnostic while
capable of adjusting to the nuances of the underlying task. Through both
quantitative and qualitative experiments across multiple benchmarks, we verify
that despite its simplicity, the proposed approach can create detailed and
suitable prompts, further boosting the performance of current LLMs.
♻ ☆ CARE: Aligning Language Models for Regional Cultural Awareness
Existing language models (LMs) often exhibit a Western-centric bias and
struggle to represent diverse cultural knowledge. Previous attempts to address
this rely on synthetic data and express cultural knowledge only in English. In
this work, we study whether a small amount of human-written, multilingual
cultural preference data can improve LMs across various model families and
sizes. We first introduce CARE, a multilingual resource of 24.1k responses with
human preferences on 2,580 questions about Chinese and Arab cultures, all
carefully annotated by native speakers and offering more balanced coverage.
Using CARE, we demonstrate that cultural alignment improves existing LMs beyond
generic resources without compromising general capabilities. Moreover, we
evaluate the cultural awareness of LMs, native speakers, and retrieved web
content when queried in different languages. Our experiment reveals regional
disparities among LMs, which may also be reflected in the documentation gap:
native speakers often take everyday cultural commonsense and social norms for
granted, while non-natives are more likely to actively seek out and document
them. CARE is publicly available at https://github.com/Guochry/CARE (we plan to
add Japanese data in the near future).
comment: 24 pages
♻ ☆ FairPy: A Toolkit for Evaluation of Prediction Biases and their Mitigation in Large Language Models
Recent studies have demonstrated that large pretrained language models (LLMs)
such as BERT and GPT-2 exhibit biases in token prediction, often inherited from
the data distributions present in their training corpora. In response, a number
of mathematical frameworks have been proposed to quantify, identify, and
mitigate these the likelihood of biased token predictions. In this paper, we
present a comprehensive survey of such techniques tailored towards widely used
LLMs such as BERT, GPT-2, etc. We additionally introduce Fairpy, a modular and
extensible toolkit that provides plug-and-play interfaces for integrating these
mathematical tools, enabling users to evaluate both pretrained and custom
language models. Fairpy supports the implementation of existing debiasing
algorithms. The toolkit is open-source and publicly available at:
\href{https://github.com/HrishikeshVish/Fairpy}{https://github.com/HrishikeshVish/Fairpy}
♻ ☆ TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights ICLR 2025
Aiwei Liu, Haoping Bai, Zhiyun Lu, Yanchao Sun, Xiang Kong, Simon Wang, Jiulong Shan, Albin Madappally Jose, Xiaojiang Liu, Lijie Wen, Philip S. Yu, Meng Cao
Direct Preference Optimization (DPO) has been widely adopted for preference
alignment of Large Language Models (LLMs) due to its simplicity and
effectiveness. However, DPO is derived as a bandit problem in which the whole
response is treated as a single arm, ignoring the importance differences
between tokens, which may affect optimization efficiency and make it difficult
to achieve optimal results. In this work, we propose that the optimal data for
DPO has equal expected rewards for each token in winning and losing responses,
as there is no difference in token importance. However, since the optimal
dataset is unavailable in practice, we propose using the original dataset for
importance sampling to achieve unbiased optimization. Accordingly, we propose a
token-level importance sampling DPO objective named TIS-DPO that assigns
importance weights to each token based on its reward. Inspired by previous
works, we estimate the token importance weights using the difference in
prediction probabilities from a pair of contrastive LLMs. We explore three
methods to construct these contrastive LLMs: (1) guiding the original LLM with
contrastive prompts, (2) training two separate LLMs using winning and losing
responses, and (3) performing forward and reverse DPO training with winning and
losing responses. Experiments show that TIS-DPO significantly outperforms
various baseline methods on harmlessness and helpfulness alignment and
summarization tasks. We also visualize the estimated weights, demonstrating
their ability to identify key token positions.
comment: Published in ICLR 2025, code in https://github.com/exlaw/TIS-DPO
♻ ☆ LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving
Hao Sha, Yao Mu, Yuxuan Jiang, Li Chen, Chenfeng Xu, Ping Luo, Shengbo Eben Li, Masayoshi Tomizuka, Wei Zhan, Mingyu Ding
Existing learning-based autonomous driving (AD) systems face challenges in
comprehending high-level information, generalizing to rare events, and
providing interpretability. To address these problems, this work employs Large
Language Models (LLMs) as a decision-making component for complex AD scenarios
that require human commonsense understanding. We devise cognitive pathways to
enable comprehensive reasoning with LLMs, and develop algorithms for
translating LLM decisions into actionable driving commands. Through this
approach, LLM decisions are seamlessly integrated with low-level controllers by
guided parameter matrix adaptation. Extensive experiments demonstrate that our
proposed method not only consistently surpasses baseline approaches in
single-vehicle tasks, but also helps handle complex driving behaviors even
multi-vehicle coordination, thanks to the commonsense reasoning capabilities of
LLMs. This paper presents an initial step toward leveraging LLMs as effective
decision-makers for intricate AD scenarios in terms of safety, efficiency,
generalizability, and interoperability. We aspire for it to serve as
inspiration for future research in this field. Project page:
https://sites.google.com/view/llm-mpc
♻ ☆ System-1.x: Learning to Balance Fast and Slow Planning with Language Models ICLR 2025
Swarnadeep Saha, Archiki Prasad, Justin Chih-Yao Chen, Peter Hase, Elias Stengel-Eskin, Mohit Bansal
Language models can be used to solve long-horizon planning problems in two
distinct modes: a fast 'System-1' mode, directly generating plans without any
explicit search or backtracking, and a slow 'System-2' mode, planning
step-by-step by explicitly searching over possible actions. While System-2 is
typically more effective, it is also more computationally expensive, making it
infeasible for long plans or large action spaces. Moreover, isolated System-1
or 2 ignores the user's end goals, failing to provide ways to control the
model's behavior. To this end, we propose the System-1.x Planner, a
controllable planning framework with LLMs that is capable of generating hybrid
plans and balancing between the two planning modes based on the difficulty of
the problem at hand. System-1.x consists of (i) a controller, (ii) a System-1
Planner, and (iii) a System-2 Planner. Based on a user-specified hybridization
factor (x) governing the mixture between System-1 and 2, the controller
decomposes a problem into sub-goals, and classifies them as easy or hard to be
solved by either System-1 or 2, respectively. We fine-tune all three components
on top of a single base LLM, requiring only search traces as supervision.
Experiments with two diverse planning tasks -- Maze Navigation and Blocksworld
-- show that our System-1.x Planner outperforms a System-1 Planner, a System-2
Planner trained to approximate A* search, and also a symbolic planner (A*). We
demonstrate the following key properties of our planner: (1) controllability:
increasing the hybridization factor (e.g., System-1.75 vs 1.5) performs more
search, improving performance, (2) flexibility: by building a neuro-symbolic
variant with a neural System-1 and a symbolic System-2, we can use existing
symbolic methods, and (3) generalizability: by being able to learn from
different search algorithms, our method is robust to the choice of search
algorithm.
comment: ICLR 2025 (Camera-Ready)
♻ ☆ LLM$\times$MapReduce-V2: Entropy-Driven Convolutional Test-Time Scaling for Generating Long-Form Articles from Extremely Long Resources
Haoyu Wang, Yujia Fu, Zhu Zhang, Shuo Wang, Zirui Ren, Xiaorong Wang, Zhili Li, Chaoqun He, Bo An, Zhiyuan Liu, Maosong Sun
Long-form generation is crucial for a wide range of practical applications,
typically categorized into short-to-long and long-to-long generation. While
short-to-long generations have received considerable attention, generating long
texts from extremely long resources remains relatively underexplored. The
primary challenge in long-to-long generation lies in effectively integrating
and analyzing relevant information from extensive inputs, which remains
difficult for current large language models (LLMs). In this paper, we propose
LLM$\times$MapReduce-V2, a novel test-time scaling strategy designed to enhance
the ability of LLMs to process extremely long inputs. Drawing inspiration from
convolutional neural networks, which iteratively integrate local features into
higher-level global representations, LLM$\times$MapReduce-V2 utilizes stacked
convolutional scaling layers to progressively expand the understanding of input
materials. Both quantitative and qualitative experimental results demonstrate
that our approach substantially enhances the ability of LLMs to process long
inputs and generate coherent, informative long-form articles, outperforming
several representative baselines. Both LLM$\times$MapReduce-V2 and SurveyEval
are publicly available at https://github.com/thunlp/LLMxMapReduce .
♻ ☆ IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities AAAI 2025
In the field of multimodal large language models (MLLMs), common methods
typically involve unfreezing the language model during training to foster
profound visual understanding. However, the fine-tuning of such models with
vision-language data often leads to a diminution of their natural language
processing (NLP) capabilities. To avoid this performance degradation, a
straightforward solution is to freeze the language model while developing
multimodal competencies. Unfortunately, previous works have not attained
satisfactory outcomes. Building on the strategy of freezing the language model,
we conduct thorough structural exploration and introduce the Inner-Adaptor
Architecture (IAA). Specifically, the architecture incorporates multiple
multimodal adaptors at varying depths within the large language model to
facilitate direct interaction with the inherently text-oriented transformer
layers, thereby enabling the frozen language model to acquire multimodal
capabilities. Unlike previous approaches of freezing language models that
require large-scale aligned data, our proposed architecture is able to achieve
superior performance on small-scale datasets. We conduct extensive experiments
to improve the general multimodal capabilities and visual grounding abilities
of the MLLM. Our approach remarkably outperforms previous state-of-the-art
methods across various vision-language benchmarks without sacrificing
performance on NLP tasks. Code and models are available at
https://github.com/360CVGroup/Inner-Adaptor-Architecture.
comment: AAAI 2025
♻ ☆ Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes
Structured pruning is a promising approach to create smaller, faster LLMs.
However, existing methods typically rely on backward passes, which can inflate
memory requirements and compute costs. In this work we introduce Bonsai, a
gradient-free structured pruning method that eliminates the need for
backpropagation, significantly reducing memory requirements and compute costs
while achieving state-of-the-art pruning performance. Bonsai uses
forward-pass-only perturbative pruning to enable efficient compression of large
models on a broader range of hardware configurations. Unlike existing
structured pruning approaches, Bonsai not only achieves better compression with
fewer resources, but also produces models that are twice as fast as those
generated by semi-structured pruning. As a concrete demonstration, we use
Bonsai to prune an 8B LLaMA-3 model to 50% sparsity on a single A6000 GPU -- a
task infeasible with backprop-based methods, which require 2-3x memory. Our
results show that removing backprop as a requirement not only enables pruning
larger models on constrained hardware but can also lead to state-of-the-art
efficiency and performance.
comment: 19 pages, 6 fiigures, 16 tables
♻ ☆ Large language models could be rote learners
Multiple-choice question (MCQ) benchmarks are widely used for evaluating
Large Language Models (LLMs), yet their reliability is undermined by benchmark
contamination. In this study, we reframe contamination as an inherent aspect of
learning and seek to disentangle genuine capability acquisition from
superficial memorization in LLM evaluation. First, by analyzing model
performance under different memorization conditions, we uncover a
counterintuitive trend: LLMs perform worse on memorized MCQs than on
non-memorized ones, indicating the coexistence of two distinct learning
phenomena, i.e., rote memorization and genuine capability learning. To
disentangle them, we propose TrinEval, a novel evaluation framework that
reformulates MCQs into an alternative trinity format, reducing memorization
while preserving knowledge assessment. Experiments validate TrinEval's
effectiveness in reformulation, and its evaluation reveals that common LLMs may
memorize by rote 20.5% of knowledge points (in MMLU on average).
comment: Work in Progress
♻ ☆ DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
Large Language Models (LLMs) equipped with web search capabilities have
demonstrated impressive potential for deep research tasks. However, current
approaches predominantly rely on either manually engineered prompts (prompt
engineering-based) with brittle performance or reinforcement learning within
controlled Retrieval-Augmented Generation (RAG) environments (RAG-based) that
fail to capture the complexities of real-world interaction. In this paper, we
introduce DeepResearcher, the first comprehensive framework for end-to-end
training of LLM-based deep research agents through scaling reinforcement
learning (RL) in real-world environments with authentic web search
interactions. Unlike RAG-based approaches that assume all necessary information
exists within a fixed corpus, our method trains agents to navigate the noisy,
unstructured, and dynamic nature of the open web. We implement a specialized
multi-agent architecture where browsing agents extract relevant information
from various webpage structures and overcoming significant technical
challenges. Extensive experiments on open-domain research tasks demonstrate
that DeepResearcher achieves substantial improvements of up to 28.9 points over
prompt engineering-based baselines and up to 7.2 points over RAG-based RL
agents. Our qualitative analysis reveals emergent cognitive behaviors from
end-to-end RL training, including the ability to formulate plans,
cross-validate information from multiple sources, engage in self-reflection to
redirect research, and maintain honesty when unable to find definitive answers.
Our results highlight that end-to-end training in real-world web environments
is not merely an implementation detail but a fundamental requirement for
developing robust research capabilities aligned with real-world applications.
We release DeepResearcher at https://github.com/GAIR-NLP/DeepResearcher.
♻ ☆ AFlow: Automating Agentic Workflow Generation
Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, Chenglin Wu
Large language models (LLMs) have demonstrated remarkable potential in
solving complex tasks across diverse domains, typically by employing agentic
workflows that follow detailed instructions and operational sequences. However,
constructing these workflows requires significant human effort, limiting
scalability and generalizability. Recent research has sought to automate the
generation and optimization of these workflows, but existing methods still rely
on initial manual setup and fall short of achieving fully automated and
effective workflow generation. To address this challenge, we reformulate
workflow optimization as a search problem over code-represented workflows,
where LLM-invoking nodes are connected by edges. We introduce AFlow, an
automated framework that efficiently explores this space using Monte Carlo Tree
Search, iteratively refining workflows through code modification,
tree-structured experience, and execution feedback. Empirical evaluations
across six benchmark datasets demonstrate AFlow's efficacy, yielding a 5.7%
average improvement over state-of-the-art baselines. Furthermore, AFlow enables
smaller models to outperform GPT-4o on specific tasks at 4.55% of its inference
cost in dollars. The code is available at
https://github.com/FoundationAgents/AFlow.
♻ ☆ Are Generative AI Agents Effective Personalized Financial Advisors? SIGIR 2025
Large language model-based agents are becoming increasingly popular as a
low-cost mechanism to provide personalized, conversational advice, and have
demonstrated impressive capabilities in relatively simple scenarios, such as
movie recommendations. But how do these agents perform in complex high-stakes
domains, where domain expertise is essential and mistakes carry substantial
risk? This paper investigates the effectiveness of LLM-advisors in the finance
domain, focusing on three distinct challenges: (1) eliciting user preferences
when users themselves may be unsure of their needs, (2) providing personalized
guidance for diverse investment preferences, and (3) leveraging advisor
personality to build relationships and foster trust. Via a lab-based user study
with 64 participants, we show that LLM-advisors often match human advisor
performance when eliciting preferences, although they can struggle to resolve
conflicting user needs. When providing personalized advice, the LLM was able to
positively influence user behavior, but demonstrated clear failure modes. Our
results show that accurate preference elicitation is key, otherwise, the
LLM-advisor has little impact, or can even direct the investor toward
unsuitable assets. More worryingly, users appear insensitive to the quality of
advice being given, or worse these can have an inverse relationship. Indeed,
users reported a preference for and increased satisfaction as well as emotional
trust with LLMs adopting an extroverted persona, even though those agents
provided worse advice.
comment: Accepted for presentation at SIGIR 2025