Computation and Language
☆ dMel: Speech Tokenization made Simple
Large language models have revolutionized natural language processing by
leveraging self-supervised pretraining on vast textual data. Inspired by this
success, researchers have investigated complicated speech tokenization methods
to discretize continuous speech signals so that language modeling techniques
can be applied to speech data. However, existing approaches either model
semantic tokens, potentially losing acoustic information, or model acoustic
tokens, risking the loss of semantic information. Having multiple token types
also complicates the architecture and requires additional pretraining. Here we
show that discretizing mel-filterbank channels into discrete intensity bins
produces a simple representation (dMel), that performs better than other
existing speech tokenization methods. Using a transformer decoder-only
architecture for speech-text modeling, we comprehensively evaluate different
speech tokenization methods on speech recognition (ASR), speech synthesis
(TTS). Our results demonstrate the effectiveness of dMel in achieving high
performance on both tasks within a unified framework, paving the way for
efficient and effective joint modeling of speech and text.
comment: under review
☆ J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling
Spoken dialogue plays a crucial role in human-AI interactions, necessitating
dialogue-oriented spoken language models (SLMs). To develop versatile SLMs,
large-scale and diverse speech datasets are essential. Additionally, to ensure
hiqh-quality speech generation, the data must be spontaneous like in-wild data
and must be acoustically clean with noise removed. Despite the critical need,
no open-source corpus meeting all these criteria has been available. This study
addresses this gap by constructing and releasing a large-scale spoken dialogue
corpus, named Japanese Corpus for Human-AI Talks (J-CHAT), which is publicly
accessible. Furthermore, this paper presents a language-independent method for
corpus construction and describes experiments on dialogue generation using SLMs
trained on J-CHAT. Experimental results indicate that the collected data from
multiple domains by our method improve the naturalness and meaningfulness of
dialogue generation.
comment: 8 pages, 6 figures
☆ Perceptions of Linguistic Uncertainty by Language Models and Humans
Uncertainty expressions such as ``probably'' or ``highly unlikely'' are
pervasive in human language. While prior work has established that there is
population-level agreement in terms of how humans interpret these expressions,
there has been little inquiry into the abilities of language models to
interpret such expressions. In this paper, we investigate how language models
map linguistic expressions of uncertainty to numerical responses. Our approach
assesses whether language models can employ theory of mind in this setting:
understanding the uncertainty of another agent about a particular statement,
independently of the model's own certainty about that statement. We evaluate
both humans and 10 popular language models on a task created to assess these
abilities. Unexpectedly, we find that 8 out of 10 models are able to map
uncertainty expressions to probabilistic responses in a human-like manner.
However, we observe systematically different behavior depending on whether a
statement is actually true or false. This sensitivity indicates that language
models are substantially more susceptible to bias based on their prior
knowledge (as compared to humans). These findings raise important questions and
have broad implications for human-AI alignment and AI-AI communication.
comment: In submission
☆ FSboard: Over 3 million characters of ASL fingerspelling collected via smartphones
Manfred Georg, Garrett Tanzer, Saad Hassan, Maximus Shengelia, Esha Uboweja, Sam Sepah, Sean Forbes, Thad Starner
Progress in machine understanding of sign languages has been slow and
hampered by limited data. In this paper, we present FSboard, an American Sign
Language fingerspelling dataset situated in a mobile text entry use case,
collected from 147 paid and consenting Deaf signers using Pixel 4A selfie
cameras in a variety of environments. Fingerspelling recognition is an
incomplete solution that is only one small part of sign language translation,
but it could provide some immediate benefit to Deaf/Hard of Hearing signers as
more broadly capable technology develops. At >3 million characters in length
and >250 hours in duration, FSboard is the largest fingerspelling recognition
dataset to date by a factor of >10x. As a simple baseline, we finetune 30 Hz
MediaPipe Holistic landmark inputs into ByT5-Small and achieve 11.1% Character
Error Rate (CER) on a test set with unique phrases and signers. This quality
degrades gracefully when decreasing frame rate and excluding face/body
landmarks: plausible optimizations to help models run on device in real time.
comment: Access FSboard at https://www.kaggle.com/datasets/googleai/fsboard
☆ Extracting Structured Insights from Financial News: An Augmented LLM Driven Approach
Financial news plays a crucial role in decision-making processes across the
financial sector, yet the efficient processing of this information into a
structured format remains challenging. This paper presents a novel approach to
financial news processing that leverages Large Language Models (LLMs) to
overcome limitations that previously prevented the extraction of structured
data from unstructured financial news. We introduce a system that extracts
relevant company tickers from raw news article content, performs sentiment
analysis at the company level, and generates summaries, all without relying on
pre-structured data feeds. Our methodology combines the generative capabilities
of LLMs, and recent prompting techniques, with a robust validation framework
that uses a tailored string similarity approach. Evaluation on a dataset of
5530 financial news articles demonstrates the effectiveness of our approach,
with 90% of articles not missing any tickers compared with current data
providers, and 22% of articles having additional relevant tickers. In addition
to this paper, the methodology has been implemented at scale with the resulting
processed data made available through a live API endpoint, which is updated in
real-time with the latest news. To the best of our knowledge, we are the first
data provider to offer granular, per-company sentiment analysis from news
articles, enhancing the depth of information available to market participants.
We also release the evaluation dataset of 5530 processed articles as a static
file, which we hope will facilitate further research leveraging financial news.
comment: 7 pages, 6 figures
☆ Conditioned Language Policy: A General Framework for Steerable Multi-Objective Finetuning
Kaiwen Wang, Rahul Kidambi, Ryan Sullivan, Alekh Agarwal, Christoph Dann, Andrea Michi, Marco Gelmi, Yunxuan Li, Raghav Gupta, Avinava Dubey, Alexandre Ramé, Johan Ferret, Geoffrey Cideron, Le Hou, Hongkun Yu, Amr Ahmed, Aranyak Mehta, Léonard Hussenot, Olivier Bachem, Edouard Leurent
Reward-based finetuning is crucial for aligning language policies with
intended behaviors (e.g., creativity and safety). A key challenge here is to
develop steerable language models that trade-off multiple (conflicting)
objectives in a flexible and efficient manner. This paper presents Conditioned
Language Policy (CLP), a general framework for finetuning language models on
multiple objectives. Building on techniques from multi-task training and
parameter-efficient finetuning, CLP can learn steerable models that effectively
trade-off conflicting objectives at inference time. Notably, this does not
require training or maintaining multiple models to achieve different trade-offs
between the objectives. Through an extensive set of experiments and ablations,
we show that the CLP framework learns steerable models that outperform and
Pareto-dominate the current state-of-the-art approaches for multi-objective
finetuning.
comment: 40 pages
☆ LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
Large multimodal models (LMMs) are processing increasingly longer and richer
inputs. Albeit the progress, few public benchmark is available to measure such
development. To mitigate this gap, we introduce LongVideoBench, a
question-answering benchmark that features video-language interleaved inputs up
to an hour long. Our benchmark includes 3,763 varying-length web-collected
videos with their subtitles across diverse themes, designed to comprehensively
evaluate LMMs on long-term multimodal understanding. To achieve this, we
interpret the primary challenge as to accurately retrieve and reason over
detailed multimodal information from long inputs. As such, we formulate a novel
video question-answering task termed referring reasoning. Specifically, as part
of the question, it contains a referring query that references related video
contexts, called referred context. The model is then required to reason over
relevant video details from the referred context. Following the paradigm of
referring reasoning, we curate 6,678 human-annotated multiple-choice questions
in 17 fine-grained categories, establishing one of the most comprehensive
benchmarks for long-form video understanding. Evaluations suggest that the
LongVideoBench presents significant challenges even for the most advanced
proprietary models (e.g. GPT-4o, Gemini-1.5-Pro, GPT-4-Turbo), while their
open-source counterparts show an even larger performance gap. In addition, our
results indicate that model performance on the benchmark improves only when
they are capable of processing more frames, positioning LongVideoBench as a
valuable benchmark for evaluating future-generation long-context LMMs.
comment: 29 pages
☆ OMoS-QA: A Dataset for Cross-Lingual Extractive Question Answering in a German Migration Context
When immigrating to a new country, it is easy to feel overwhelmed by the need
to obtain information on financial support, housing, schooling, language
courses, and other issues. If relocation is rushed or even forced, the
necessity for high-quality answers to such questions is all the more urgent.
Official immigration counselors are usually overbooked, and online systems
could guide newcomers to the requested information or a suitable counseling
service.
To this end, we present OMoS-QA, a dataset of German and English questions
paired with relevant trustworthy documents and manually annotated answers,
specifically tailored to this scenario. Questions are automatically generated
with an open-source large language model (LLM) and answer sentences are
selected by crowd workers with high agreement. With our data, we conduct a
comparison of 5 pretrained LLMs on the task of extractive question answering
(QA) in German and English. Across all models and both languages, we find high
precision and low-to-mid recall in selecting answer sentences, which is a
favorable trade-off to avoid misleading users. This performance even holds up
when the question language does not match the document language. When it comes
to identifying unanswerable questions given a context, there are larger
differences between the two languages.
comment: Accepted to KONVENS 2024
☆ DStruct2Design: Data and Benchmarks for Data Structure Driven Generative Floor Plan Design
Text conditioned generative models for images have yielded impressive
results. Text conditioned floorplan generation as a special type of raster
image generation task also received particular attention. However there are
many use cases in floorpla generation where numerical properties of the
generated result are more important than the aesthetics. For instance, one
might want to specify sizes for certain rooms in a floorplan and compare the
generated floorplan with given specifications Current approaches, datasets and
commonly used evaluations do not support these kinds of constraints. As such,
an attractive strategy is to generate an intermediate data structure that
contains numerical properties of a floorplan which can be used to generate the
final floorplan image. To explore this setting we (1) construct a new dataset
for this data-structure to data-structure formulation of floorplan generation
using two popular image based floorplan datasets RPLAN and ProcTHOR-10k, and
provide the tools to convert further procedurally generated ProcTHOR floorplan
data into our format. (2) We explore the task of floorplan generation given a
partial or complete set of constraints and we design a series of metrics and
benchmarks to enable evaluating how well samples generated from models respect
the constraints. (3) We create multiple baselines by finetuning a large
language model (LLM), Llama3, and demonstrate the feasibility of using
floorplan data structure conditioned LLMs for the problem of floorplan
generation respecting numerical constraints. We hope that our new datasets and
benchmarks will encourage further research on different ways to improve the
performance of LLMs and other generative modelling techniques for generating
designs where quantitative constraints are only partially specified, but must
be respected.
☆ Do Large Language Models Have Compositional Ability? An Investigation into Limitations and Scalability
Large language models (LLMs) have emerged as powerful tools for many AI
problems and exhibit remarkable in-context learning (ICL) capabilities.
Compositional ability, solving unseen complex tasks that combine two or more
simple tasks, is an essential reasoning ability for Artificial General
Intelligence. Despite LLM's tremendous success, how they approach composite
tasks, especially those not encountered during the pretraining phase, remains
an open question and largely ununderstood. In this study, we delve into the ICL
capabilities of LLMs on composite tasks, with only simple tasks as in-context
examples. We develop a test suite of composite tasks that include linguistic
and logical challenges and perform empirical studies across different LLM
families. We observe that models exhibit divergent behaviors: (1) For simpler
composite tasks that apply distinct mapping mechanisms to different input
segments, the models demonstrate decent compositional ability, while scaling up
the model enhances this ability; (2) for more complex composite tasks that
involving reasoning multiple steps, where each step represent one task, models
typically underperform, and scaling up generally provide no improvements. We
offer theoretical analysis in a simplified setting, explaining that models
exhibit compositional capability when the task handles different input parts
separately. We believe our work sheds new light on the capabilities of LLMs in
solving composite tasks regarding the nature of the tasks and model scale. Our
dataset and code are available at
{\url{https://github.com/OliverXUZY/LLM_Compose}}.
☆ AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?
Language agents, built on top of language models (LMs), are systems that can
interact with complex environments, such as the open web. In this work, we
examine whether such agents can perform realistic and time-consuming tasks on
the web, e.g., monitoring real-estate markets or locating relevant nearby
businesses. We introduce AssistantBench, a challenging new benchmark consisting
of 214 realistic tasks that can be automatically evaluated, covering different
scenarios and domains. We find that AssistantBench exposes the limitations of
current systems, including language models and retrieval-augmented language
models, as no model reaches an accuracy of more than 25 points. While
closed-book LMs perform well, they exhibit low precision since they tend to
hallucinate facts. State-of-the-art web agents reach a score of near zero.
Additionally, we introduce SeePlanAct (SPA), a new web agent that significantly
outperforms previous agents, and an ensemble of SPA and closed-book models
reaches the best overall performance. Moreover, we analyze failures of current
systems and highlight that web navigation remains a major challenge.
☆ Supporting the Digital Autonomy of Elders Through LLM Assistance
The internet offers tremendous access to services, social connections, and
needed products. However, to those without sufficient experience, engaging with
businesses and friends across the internet can be daunting due to the ever
present danger of scammers and thieves, to say nothing of the myriad of
potential computer viruses. Like a forest rich with both edible and poisonous
plants, those familiar with the norms inhabit it safely with ease while
newcomers need a guide. However, reliance on a human digital guide can be
taxing and often impractical. We propose and pilot a simple but unexplored
idea: could an LLM provide the necessary support to help the elderly who are
separated by the digital divide safely achieve digital autonomy?
☆ Counter Turing Test ($CT^2$): Investigating AI-Generated Text Detection for Hindi -- Ranking LLMs based on Hindi AI Detectability Index ($ADI_{hi}$)
The widespread adoption of large language models (LLMs) and awareness around
multilingual LLMs have raised concerns regarding the potential risks and
repercussions linked to the misapplication of AI-generated text, necessitating
increased vigilance. While these models are primarily trained for English,
their extensive training on vast datasets covering almost the entire web,
equips them with capabilities to perform well in numerous other languages.
AI-Generated Text Detection (AGTD) has emerged as a topic that has already
received immediate attention in research, with some initial methods having been
proposed, soon followed by the emergence of techniques to bypass detection. In
this paper, we report our investigation on AGTD for an indic language Hindi.
Our major contributions are in four folds: i) examined 26 LLMs to evaluate
their proficiency in generating Hindi text, ii) introducing the AI-generated
news article in Hindi ($AG_{hi}$) dataset, iii) evaluated the effectiveness of
five recently proposed AGTD techniques: ConDA, J-Guard, RADAR, RAIDAR and
Intrinsic Dimension Estimation for detecting AI-generated Hindi text, iv)
proposed Hindi AI Detectability Index ($ADI_{hi}$) which shows a spectrum to
understand the evolving landscape of eloquence of AI-generated text in Hindi.
We will make the codes and datasets available to encourage further research.
☆ Psychometric Alignment: Capturing Human Knowledge Distributions via Language Models
Joy He-Yueya, Wanjing Anya Ma, Kanishk Gandhi, Benjamin W. Domingue, Emma Brunskill, Noah D. Goodman
Language models (LMs) are increasingly used to simulate human-like responses
in scenarios where accurately mimicking a population's behavior can guide
decision-making, such as in developing educational materials and designing
public policies. The objective of these simulations is for LMs to capture the
variations in human responses, rather than merely providing the expected
correct answers. Prior work has shown that LMs often generate unrealistically
accurate responses, but there are no established metrics to quantify how
closely the knowledge distribution of LMs aligns with that of humans. To
address this, we introduce "psychometric alignment," a metric that measures the
extent to which LMs reflect human knowledge distribution. Assessing this
alignment involves collecting responses from both LMs and humans to the same
set of test items and using Item Response Theory to analyze the differences in
item functioning between the groups. We demonstrate that our metric can capture
important variations in populations that traditional metrics, like differences
in accuracy, fail to capture. We apply this metric to assess existing LMs for
their alignment with human knowledge distributions across three real-world
domains. We find significant misalignment between LMs and human populations,
though using persona-based prompts can improve alignment. Interestingly,
smaller LMs tend to achieve greater psychometric alignment than larger LMs.
Further, training LMs on human response data from the target distribution
enhances their psychometric alignment on unseen test items, but the
effectiveness of such training varies across domains.
comment: Code and data: https://github.com/joyheyueya/psychometric-alignment
☆ RadioRAG: Factual Large Language Models for Enhanced Diagnostics in Radiology Using Dynamic Retrieval Augmented Generation
Soroosh Tayebi Arasteh, Mahshad Lotfinia, Keno Bressem, Robert Siepmann, Dyke Ferber, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn
Large language models (LLMs) have advanced the field of artificial
intelligence (AI) in medicine. However LLMs often generate outdated or
inaccurate information based on static training datasets. Retrieval augmented
generation (RAG) mitigates this by integrating outside data sources. While
previous RAG systems used pre-assembled, fixed databases with limited
flexibility, we have developed Radiology RAG (RadioRAG) as an end-to-end
framework that retrieves data from authoritative radiologic online sources in
real-time. RadioRAG is evaluated using a dedicated radiologic
question-and-answer dataset (RadioQA). We evaluate the diagnostic accuracy of
various LLMs when answering radiology-specific questions with and without
access to additional online information via RAG. Using 80 questions from RSNA
Case Collection across radiologic subspecialties and 24 additional
expert-curated questions, for which the correct gold-standard answers were
available, LLMs (GPT-3.5-turbo, GPT-4, Mistral-7B, Mixtral-8x7B, and Llama3 [8B
and 70B]) were prompted with and without RadioRAG. RadioRAG retrieved
context-specific information from www.radiopaedia.org in real-time and
incorporated them into its reply. RadioRAG consistently improved diagnostic
accuracy across all LLMs, with relative improvements ranging from 2% to 54%. It
matched or exceeded question answering without RAG across radiologic
subspecialties, particularly in breast imaging and emergency radiology.
However, degree of improvement varied among models; GPT-3.5-turbo and
Mixtral-8x7B-instruct-v0.1 saw notable gains, while Mistral-7B-instruct-v0.2
showed no improvement, highlighting variability in its effectiveness. LLMs
benefit when provided access to domain-specific data beyond their training
data. For radiology, RadioRAG establishes a robust framework that substantially
improves diagnostic accuracy and factuality in radiological question answering.
☆ Can GPT-4 learn to analyze moves in research article abstracts?
One of the most powerful and enduring ideas in written discourse analysis is
that genres can be described in terms of the moves which structure a writer's
purpose. Considerable research has sought to identify these distinct
communicative acts, but analyses have been beset by problems of subjectivity,
reliability and the time-consuming need for multiple coders to confirm
analyses. In this paper we employ the affordances of GPT-4 to automate the
annotation process by using natural language prompts. Focusing on abstracts
from articles in four applied linguistics journals, we devise prompts which
enable the model to identify moves effectively. The annotated outputs of these
prompts were evaluated by two assessors with a third addressing disagreements.
The results show that an 8-shot prompt was more effective than one using two,
confirming that the inclusion of examples illustrating areas of variability can
enhance GPT-4's ability to recognize multiple moves in a single sentence and
reduce bias related to textual position. We suggest that GPT-4 offers
considerable potential in automating this annotation process, when human actors
with domain specific linguistic expertise inform the prompting process.
☆ StylusAI: Stylistic Adaptation for Robust German Handwritten Text Generation ICDAR 2024
In this study, we introduce StylusAI, a novel architecture leveraging
diffusion models in the domain of handwriting style generation. StylusAI is
specifically designed to adapt and integrate the stylistic nuances of one
language's handwriting into another, particularly focusing on blending English
handwriting styles into the context of the German writing system. This approach
enables the generation of German text in English handwriting styles and German
handwriting styles into English, enriching machine-generated handwriting
diversity while ensuring that the generated text remains legible across both
languages. To support the development and evaluation of StylusAI, we present
the \lq{Deutscher Handschriften-Datensatz}\rq~(DHSD), a comprehensive dataset
encompassing 37 distinct handwriting styles within the German language. This
dataset provides a fundamental resource for training and benchmarking in the
realm of handwritten text generation. Our results demonstrate that StylusAI not
only introduces a new method for style adaptation in handwritten text
generation but also surpasses existing models in generating handwriting samples
that improve both text quality and stylistic fidelity, evidenced by its
performance on the IAM database and our newly proposed DHSD. Thus, StylusAI
represents a significant advancement in the field of handwriting style
generation, offering promising avenues for future research and applications in
cross-linguistic style adaptation for languages with similar scripts.
comment: Accepted in ICDAR 2024
☆ Unsupervised Robust Cross-Lingual Entity Alignment via Joint Modeling of Entity and Relation Texts
Cross-lingual entity alignment (EA) enables the integration of multiple
knowledge graphs (KGs) across different languages, providing users with
seamless access to diverse and comprehensive knowledge.Existing methods, mostly
supervised, face challenges in obtaining labeled entity pairs. To address this,
recent studies have shifted towards a self-supervised and unsupervised
frameworks. Despite their effectiveness, these approaches have limitations: (1)
they mainly focus on entity features, neglecting the semantic information of
relations, (2) they assume isomorphism between source and target graphs,
leading to noise and reduced alignment accuracy, and (3) they are susceptible
to noise in the textual features, especially when encountering inconsistent
translations or Out-Of-Vocabulary (OOV) problems.
In this paper, we propose ERAlign, an unsupervised and robust cross-lingual
EA framework that jointly performs Entity-level and Relation-level Alignment
using semantic textual features of relations and entities. Its refinement
process iteratively enhances results by fusing entity-level and relation-level
alignments based on neighbor triple matching. The additional verification
process examines the entities' neighbor triples as the linearized text. This
\textit{Align-and-Verify} pipeline that rigorously assesses alignment results,
achieving near-perfect alignment even in the presence of noisy textual features
of entities. Our extensive experiments demonstrate that robustness and general
applicability of \proposed improved the accuracy and effectiveness of EA tasks,
contributing significantly to knowledge-oriented applications.
☆ An Empirical Study of Retrieval Augmented Generation with Chain-of-Thought
Since the launch of ChatGPT at the end of 2022, generative dialogue models
represented by ChatGPT have quickly become essential tools in daily life. As
user expectations increase, enhancing the capability of generative dialogue
models to solve complex problems has become a focal point of current research.
This paper delves into the effectiveness of the RAFT (Retrieval Augmented
Fine-Tuning) method in improving the performance of Generative dialogue models.
RAFT combines chain-of-thought with model supervised fine-tuning (SFT) and
retrieval augmented generation (RAG), which significantly enhanced the model's
information extraction and logical reasoning abilities. We evaluated the RAFT
method across multiple datasets and analysed its performance in various
reasoning tasks, including long-form QA and short-form QA tasks, tasks in both
Chinese and English, and supportive and comparison reasoning tasks. Notably, it
addresses the gaps in previous research regarding long-form QA tasks and
Chinese datasets. Moreover, we also evaluate the benefit of the
chain-of-thought (CoT) in the RAFT method. This work offers valuable insights
for studies focused on enhancing the performance of generative dialogue models.
comment: 5 pages, 4 figures
☆ SETTP: Style Extraction and Tunable Inference via Dual-level Transferable Prompt Learning
Text style transfer, an important research direction in natural language
processing, aims to adapt the text to various preferences but often faces
challenges with limited resources. In this work, we introduce a novel method
termed Style Extraction and Tunable Inference via Dual-level Transferable
Prompt Learning (SETTP) for effective style transfer in low-resource scenarios.
First, SETTP learns source style-level prompts containing fundamental style
characteristics from high-resource style transfer. During training, the source
style-level prompts are transferred through an attention module to derive a
target style-level prompt for beneficial knowledge provision in low-resource
style transfer. Additionally, we propose instance-level prompts obtained by
clustering the target resources based on the semantic content to reduce
semantic bias. We also propose an automated evaluation approach of style
similarity based on alignment with human evaluations using ChatGPT-4. Our
experiments across three resourceful styles show that SETTP requires only
1/20th of the data volume to achieve performance comparable to state-of-the-art
methods. In tasks involving scarce data like writing style and role style,
SETTP outperforms previous methods by 16.24\%.
☆ Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper
Large language models (LLMs) can often be made to behave in undesirable ways
that they are explicitly fine-tuned not to. For example, the LLM red-teaming
literature has produced a wide variety of `jailbreaking' techniques to elicit
harmful text from models that were fine-tuned to be harmless. Recent work on
red-teaming, model editing, and interpretability suggests that this challenge
stems from how (adversarial) fine-tuning largely serves to suppress rather than
remove undesirable capabilities from LLMs. Prior work has introduced latent
adversarial training (LAT) as a way to improve robustness to broad classes of
failures. These prior works have considered untargeted latent space attacks
where the adversary perturbs latent activations to maximize loss on examples of
desirable behavior. Untargeted LAT can provide a generic type of robustness but
does not leverage information about specific failure modes. Here, we experiment
with targeted LAT where the adversary seeks to minimize loss on a specific
competing task. We find that it can augment a wide variety of state-of-the-art
methods. First, we use targeted LAT to improve robustness to jailbreaks,
outperforming a strong R2D2 baseline with orders of magnitude less compute.
Second, we use it to more effectively remove backdoors with no knowledge of the
trigger. Finally, we use it to more effectively unlearn knowledge for specific
undesirable tasks in a way that is also more robust to re-learning. Overall,
our results suggest that targeted LAT can be an effective tool for defending
against harmful behaviors from LLMs.
☆ Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models
The inference demand for LLMs has skyrocketed in recent months, and serving
models with low latencies remains challenging due to the quadratic input length
complexity of the attention layers. In this work, we investigate the effect of
dropping MLP and attention layers at inference time on the performance of
Llama-v2 models. We find that dropping dreeper attention layers only marginally
decreases performance but leads to the best speedups alongside dropping entire
layers. For example, removing 33\% of attention layers in a 13B Llama2 model
results in a 1.8\% drop in average performance over the OpenLLM benchmark. We
also observe that skipping layers except the latter layers reduces performances
for more layers skipped, except for skipping the attention layers.
☆ Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners
Large Language Models (LLMs) showcase remarkable performance and robust
deductive capabilities, yet their expansive size complicates deployment and
raises environmental concerns due to substantial resource consumption. The
recent development of a quantization technique known as Learnable
Singular-value Increment (LSI) has addressed some of these quantization
challenges. Leveraging insights from LSI and our extensive research, we have
developed innovative methods that enhance the performance of quantized LLMs,
particularly in low-bit settings. Our methods consistently deliver
state-of-the-art results across various quantization scenarios and offer deep
theoretical insights into the quantization process, elucidating the potential
of quantized models for widespread application.
comment: Effecient Quantization Methods for LLMs
☆ Fundamental Limits of Prompt Compression: A Rate-Distortion Framework for Black-Box Language Models
We formalize the problem of prompt compression for large language models
(LLMs) and present a framework to unify token-level prompt compression methods
which create hard prompts for black-box models. We derive the distortion-rate
function for this setup as a linear program, and provide an efficient algorithm
to compute this fundamental limit via the dual of the linear program. Using the
distortion-rate function as the baseline, we study the performance of existing
compression schemes on a synthetic dataset consisting of prompts generated from
a Markov chain, natural language queries, and their respective answers. Our
empirical analysis demonstrates the criticality of query-aware prompt
compression, where the compressor has knowledge of the downstream task/query
for the black-box LLM. We show that there is a large gap between the
performance of current prompt compression methods and the optimal strategy, and
propose a query-aware, variable-rate adaptation of a prior work to close the
gap. We extend our experiments to a small natural language dataset to further
confirm our findings on our synthetic dataset.
comment: 40 pages, 15 figures. Under review
☆ Refining Corpora from a Model Calibration Perspective for Chinese Spelling Correction
Chinese Spelling Correction (CSC) commonly lacks large-scale high-quality
corpora, due to the labor-intensive labeling of spelling errors in real-life
human writing or typing scenarios. Two data augmentation methods are widely
adopted: (1) \textit{Random Replacement} with the guidance of confusion sets
and (2) \textit{OCR/ASR-based Generation} that simulates character misusing.
However, both methods inevitably introduce noisy data (e.g., false spelling
errors), potentially leading to over-correction. By carefully analyzing the two
types of corpora, we find that though the latter achieves more robust
generalization performance, the former yields better-calibrated CSC models. We
then provide a theoretical analysis of this empirical observation, based on
which a corpus refining strategy is proposed. Specifically, OCR/ASR-based data
samples are fed into a well-calibrated CSC model trained on random
replacement-based corpora and then filtered based on prediction confidence. By
learning a simple BERT-based model on the refined OCR/ASR-based corpus, we set
up impressive state-of-the-art performance on three widely-used benchmarks,
while significantly alleviating over-correction (e.g., lowering false positive
predictions).
☆ Two Stacks Are Better Than One: A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives
Pretrained language models (PLMs) display impressive performances and have
captured the attention of the NLP community. Establishing the best practices in
pretraining has therefore become a major point of focus for much of NLP
research -- especially since the insights developed for monolingual English
models need not carry to more complex multilingual. One significant caveat of
the current state of the art is that different works are rarely comparable:
they often discuss different parameter counts, training data, and evaluation
methodology.
This paper proposes a comparison of multilingual pretraining objectives in a
controlled methodological environment. We ensure that training data and model
architectures are comparable, and discuss the downstream performances across 6
languages that we observe in probing and fine-tuning scenarios. We make two key
observations: (1) the architecture dictates which pretraining objective is
optimal; (2) multilingual translation is a very effective pre-training
objective under the right conditions. We make our code, data, and model weights
available at \texttt{\url{https://github.com/Helsinki-NLP/lm-vs-mt}}.
☆ Text-to-Battery Recipe: A language modeling-based protocol for automatic battery recipe extraction and retrieval
Recent studies have increasingly applied natural language processing (NLP) to
automatically extract experimental research data from the extensive battery
materials literature. Despite the complex process involved in battery
manufacturing -- from material synthesis to cell assembly -- there has been no
comprehensive study systematically organizing this information. In response, we
propose a language modeling-based protocol, Text-to-Battery Recipe (T2BR), for
the automatic extraction of end-to-end battery recipes, validated using a case
study on batteries containing LiFePO4 cathode material. We report machine
learning-based paper filtering models, screening 2,174 relevant papers from the
keyword-based search results, and unsupervised topic models to identify 2,876
paragraphs related to cathode synthesis and 2,958 paragraphs related to cell
assembly. Then, focusing on the two topics, two deep learning-based named
entity recognition models are developed to extract a total of 30 entities --
including precursors, active materials, and synthesis methods -- achieving F1
scores of 88.18% and 94.61%. The accurate extraction of entities enables the
systematic generation of 165 end-toend recipes of LiFePO4 batteries. Our
protocol and results offer valuable insights into specific trends, such as
associations between precursor materials and synthesis methods, or combinations
between different precursor materials. We anticipate that our findings will
serve as a foundational knowledge base for facilitating battery-recipe
information retrieval. The proposed protocol will significantly accelerate the
review of battery material literature and catalyze innovations in battery
design and development.
☆ Developing a Reliable, General-Purpose Hallucination Detection and Mitigation Service: Insights and Lessons Learned
Song Wang, Xun Wang, Jie Mei, Yujia Xie, Sean Muarray, Zhang Li, Lingfeng Wu, Si-Qing Chen, Wayne Xiong
Hallucination, a phenomenon where large language models (LLMs) produce output
that is factually incorrect or unrelated to the input, is a major challenge for
LLM applications that require accuracy and dependability. In this paper, we
introduce a reliable and high-speed production system aimed at detecting and
rectifying the hallucination issue within LLMs. Our system encompasses named
entity recognition (NER), natural language inference (NLI), span-based
detection (SBD), and an intricate decision tree-based process to reliably
detect a wide range of hallucinations in LLM responses. Furthermore, our team
has crafted a rewriting mechanism that maintains an optimal mix of precision,
response time, and cost-effectiveness. We detail the core elements of our
framework and underscore the paramount challenges tied to response time,
availability, and performance metrics, which are crucial for real-world
deployment of these technologies. Our extensive evaluation, utilizing offline
data and live production traffic, confirms the efficacy of our proposed
framework and service.
☆ Empirical Capacity Model for Self-Attention Neural Networks
Large pretrained self-attention neural networks, or transformers, have been
very successful in various tasks recently. The performance of a model on a
given task depends on its ability to memorize and generalize the training data.
Large transformer models, which may have billions of parameters, in theory have
a huge capacity to memorize content. However, the current algorithms for the
optimization fall short of the theoretical capacity, and the capacity is also
highly dependent on the content. In this paper, we focus on the memory capacity
of these models obtained using common training algorithms and synthetic
training data. Based on the results, we derive an empirical capacity model
(ECM) for a generic transformer. The ECM can be used to design task-specific
transformer models with an optimal number of parameters in cases where the
target memorization capability of the task can be defined.
comment: Submitted to BNAIC'24, 14 pages + refs
☆ LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models
We introduces LLaST, a framework for building high-performance Large Language
model based Speech-to-text Translation systems. We address the limitations of
end-to-end speech translation(E2E ST) models by exploring model architecture
design and optimization techniques tailored for LLMs. Our approach includes
LLM-based speech translation architecture design, ASR-augmented training,
multilingual data augmentation, and dual-LoRA optimization. Our approach
demonstrates superior performance on the CoVoST-2 benchmark and showcases
exceptional scaling capabilities powered by LLMs. We believe this effective
method will serve as a strong baseline for speech translation and provide
insights for future improvements of the LLM-based speech translation framework.
We release the data, code and models in https://github.com/openaudiolab/LLaST.
☆ Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Mengru Wang, Yunzhi Yao, Ziwen Xu, Shuofei Qiao, Shumin Deng, Peng Wang, Xiang Chen, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang
Understanding knowledge mechanisms in Large Language Models (LLMs) is crucial
for advancing towards trustworthy AGI. This paper reviews knowledge mechanism
analysis from a novel taxonomy including knowledge utilization and evolution.
Knowledge utilization delves into the mechanism of memorization, comprehension
and application, and creation. Knowledge evolution focuses on the dynamic
progression of knowledge within individual and group LLMs. Moreover, we discuss
what knowledge LLMs have learned, the reasons for the fragility of parametric
knowledge, and the potential dark knowledge (hypothesis) that will be
challenging to address. We hope this work can help understand knowledge in LLMs
and provide insights for future research.
comment: Ongoing work (v1); 34 pages, 5 figures
☆ Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models
With the development of large language models (LLMs) like ChatGPT, both their
vast applications and potential vulnerabilities have come to the forefront.
While developers have integrated multiple safety mechanisms to mitigate their
misuse, a risk remains, particularly when models encounter adversarial inputs.
This study unveils an attack mechanism that capitalizes on human conversation
strategies to extract harmful information from LLMs. We delineate three pivotal
strategies: (i) decomposing malicious questions into seemingly innocent
sub-questions; (ii) rewriting overtly malicious questions into more covert,
benign-sounding ones; (iii) enhancing the harmfulness of responses by prompting
models for illustrative examples. Unlike conventional methods that target
explicit malicious responses, our approach delves deeper into the nature of the
information provided in responses. Through our experiments conducted on
GPT-3.5-turbo, GPT-4, and Llama2, our method has demonstrated a marked efficacy
compared to conventional attack methods. In summary, this work introduces a
novel attack method that outperforms previous approaches, raising an important
question: How to discern whether the ultimate intent in a dialogue is
malicious?
☆ ALLaM: Large Language Models for Arabic and English
M Saiful Bari, Yazeed Alnumay, Norah A. Alzahrani, Nouf M. Alotaibi, Hisham A. Alyahya, Sultan AlRashed, Faisal A. Mirza, Shaykhah Z. Alsubaie, Hassan A. Alahmed, Ghadah Alabduljabbar, Raghad Alkhathran, Yousef Almushayqih, Raneem Alnajim, Salman Alsubaihi, Maryam Al Mansour, Majed Alrubaian, Ali Alammari, Zaki Alawami, Abdulmohsen Al-Thubaity, Ahmed Abdelali, Jeril Kuriakose, Abdalghani Abujabal, Nora Al-Twairesh, Areeb Alowisheq, Haidar Khan
We present ALLaM: Arabic Large Language Model, a series of large language
models to support the ecosystem of Arabic Language Technologies (ALT). ALLaM is
carefully trained considering the values of language alignment and knowledge
transfer at scale. Our autoregressive decoder-only architecture models
demonstrate how second-language acquisition via vocabulary expansion and
pretraining on a mixture of Arabic and English text can steer a model towards a
new language (Arabic) without any catastrophic forgetting in the original
language (English). Furthermore, we highlight the effectiveness of using
parallel/translated data to aid the process of knowledge alignment between
languages. Finally, we show that extensive alignment with human preferences can
significantly enhance the performance of a language model compared to models of
a larger scale with lower quality alignment. ALLaM achieves state-of-the-art
performance in various Arabic benchmarks, including MMLU Arabic, ACVA, and
Arabic Exams. Our aligned models improve both in Arabic and English from their
base aligned models.
☆ The Development of a Comprehensive Spanish Dictionary for Phonetic and Lexical Tagging in Socio-phonetic Research (ESPADA) LREC2022
Pronunciation dictionaries are an important component in the process of
speech forced alignment. The accuracy of these dictionaries has a strong effect
on the aligned speech data since they help the mapping between orthographic
transcriptions and acoustic signals. In this paper, I present the creation of a
comprehensive pronunciation dictionary in Spanish (ESPADA) that can be used in
most of the dialect variants of Spanish data. Current dictionaries focus on
specific regional variants, but with the flexible nature of our tool, it can be
readily applied to capture the most common phonetic differences across major
dialectal variants. We propose improvements to current pronunciation
dictionaries as well as mapping other relevant annotations such as
morphological and lexical information. In terms of size, it is currently the
most complete dictionary with more than 628,000 entries, representing words
from 16 countries. All entries come with their corresponding pronunciations,
morphological and lexical tagging, and other relevant information for phonetic
analysis: stress patterns, phonotactics, IPA transcriptions, and more. This
aims to equip socio-phonetic researchers with a complete open-source tool that
enhances dialectal research within socio-phonetic frameworks in the Spanish
language.
comment: Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI)
within LREC2022
☆ ILiAD: An Interactive Corpus for Linguistic Annotated Data from Twitter Posts
Social Media platforms have offered invaluable opportunities for linguistic
research. The availability of up-to-date data, coming from any part in the
world, and coming from natural contexts, has allowed researchers to study
language in real time. One of the fields that has made great use of social
media platforms is Corpus Linguistics. There is currently a wide range of
projects which have been able to successfully create corpora from social media.
In this paper, we present the development and deployment of a linguistic corpus
from Twitter posts in English, coming from 26 news agencies and 27 individuals.
The main goal was to create a fully annotated English corpus for linguistic
analysis. We include information on morphology and syntax, as well as NLP
features such as tokenization, lemmas, and n- grams. The information is
presented through a range of powerful visualisations for users to explore
linguistic patterns in the corpus. With this tool, we aim to contribute to the
area of language technologies applied to linguistic research.
comment: Conference on Language Technologies & Digital Humanities Ljubljana,
2022
☆ A Network Analysis Approach to Conlang Research Literature
The field of conlang has evidenced an important growth in the last decades.
This has been the product of a wide interest in the use and study of conlangs
for artistic purposes. However, one important question is what it is happening
with conlang in the academic world. This paper aims to have an overall
understanding of the literature on conlang research. With this we aim to give a
realistic picture of the field in present days. We have implemented a
computational linguistic approach, combining bibliometrics and network analysis
to examine all publications available in the Scopus database. Analysing over
2300 academic publications since 1927 until 2022, we have found that Esperanto
is by far the most documented conlang. Three main authors have contributed to
this: Garv\'ia R., Fiedler S., and Blanke D. The 1970s and 1980s have been the
decades where the foundations of current research have been built. In terms of
methodologies, language learning and experimental linguistics are the ones
contributing to most to the preferred approaches of study in the field. We
present the results and discuss our limitations and future work.
☆ Walking in Others' Shoes: How Perspective-Taking Guides Large Language Models in Reducing Toxicity and Bias
The common toxicity and societal bias in contents generated by large language
models (LLMs) necessitate strategies to reduce harm. Present solutions often
demand white-box access to the model or substantial training, which is
impractical for cutting-edge commercial LLMs. Moreover, prevailing prompting
methods depend on external tool feedback and fail to simultaneously lessen
toxicity and bias. Motivated by social psychology principles, we propose a
novel strategy named \textbf{perspective-taking prompting (\textsc{PeT})} that
inspires LLMs to integrate diverse human perspectives and self-regulate their
responses. This self-correction mechanism can significantly diminish toxicity
(up to $89\%$) and bias (up to $73\%$) in LLMs' responses. Rigorous evaluations
and ablation studies are conducted on two commercial LLMs (ChatGPT and GLM) and
three open-source LLMs, revealing \textsc{PeT}'s superiority in producing less
harmful responses, outperforming five strong baselines.
☆ Dissecting Multiplication in Transformers: Insights into LLMs
Transformer-based large language models have achieved remarkable performance
across various natural language processing tasks. However, they often struggle
with seemingly easy tasks like arithmetic despite their vast capabilities. This
stark disparity raise human's concerns about their safe and ethical use, hinder
their widespread adoption.In this paper, we focus on a typical arithmetic task,
integer multiplication, to explore and explain the imperfection of transformers
in this domain. We provide comprehensive analysis of a vanilla transformer
trained to perform n-digit integer multiplication. Our observations indicate
that the model decomposes multiplication task into multiple parallel subtasks,
sequentially optimizing each subtask for each digit to complete the final
multiplication. Based on observation and analysis, we infer the reasons of
transformers deficiencies in multiplication tasks lies in their difficulty in
calculating successive carryovers and caching intermediate results, and
confirmed this inference through experiments. Guided by these findings, we
propose improvements to enhance transformers performance on multiplication
tasks. These enhancements are validated through rigorous testing and
mathematical modeling, not only enhance transformer's interpretability, but
also improve its performance, e.g., we achieve over 99.9% accuracy on 5-digit
integer multiplication with a tiny transformer, outperform LLMs GPT-4. Our
method contributes to the broader fields of model understanding and
interpretability, paving the way for analyzing more complex tasks and
Transformer models. This work underscores the importance of explainable AI,
helping to build trust in large language models and promoting their adoption in
critical applications.
comment: 8 pages, 5 figures
☆ UF-HOBI at "Discharge Me!": A Hybrid Solution for Discharge Summary Generation Through Prompt-based Tuning of GatorTronGPT Models ACL 2024
Automatic generation of discharge summaries presents significant challenges
due to the length of clinical documentation, the dispersed nature of patient
information, and the diverse terminology used in healthcare. This paper
presents a hybrid solution for generating discharge summary sections as part of
our participation in the "Discharge Me!" Challenge at the BioNLP 2024 Shared
Task. We developed a two-stage generation method using both extractive and
abstractive techniques, in which we first apply name entity recognition (NER)
to extract key clinical concepts, which are then used as input for a
prompt-tuning-based GatorTronGPT model to generate coherent text for two
important sections including "Brief Hospital Course" and "Discharge
Instructions". Our system was ranked 5th in this challenge, achieving an
overall score of 0.284. The results demonstrate the effectiveness of our hybrid
solution in improving the quality of automated discharge section generation.
comment: BIONLP 2024 and Shared Tasks @ ACL 2024
☆ Customized Retrieval Augmented Generation and Benchmarking for EDA Tool Documentation QA
Retrieval augmented generation (RAG) enhances the accuracy and reliability of
generative AI models by sourcing factual information from external databases,
which is extensively employed in document-grounded question-answering (QA)
tasks. Off-the-shelf RAG flows are well pretrained on general-purpose
documents, yet they encounter significant challenges when being applied to
knowledge-intensive vertical domains, such as electronic design automation
(EDA). This paper addresses such issue by proposing a customized RAG framework
along with three domain-specific techniques for EDA tool documentation QA,
including a contrastive learning scheme for text embedding model fine-tuning, a
reranker distilled from proprietary LLM, and a generative LLM fine-tuned with
high-quality domain corpus. Furthermore, we have developed and released a
documentation QA evaluation benchmark, ORD-QA, for OpenROAD, an advanced
RTL-to-GDSII design platform. Experimental results demonstrate that our
proposed RAG flow and techniques have achieved superior performance on ORD-QA
as well as on a commercial tool, compared with state-of-the-arts. The ORD-QA
benchmark and the training dataset for our customized RAG flow are open-source
at https://github.com/lesliepy99/RAG-EDA.
☆ MAVEN-Fact: A Large-scale Event Factuality Detection Dataset
Event Factuality Detection (EFD) task determines the factuality of textual
events, i.e., classifying whether an event is a fact, possibility, or
impossibility, which is essential for faithfully understanding and utilizing
event knowledge. However, due to the lack of high-quality large-scale data,
event factuality detection is under-explored in event understanding research,
which limits the development of EFD community. To address these issues and
provide faithful event understanding, we introduce MAVEN-Fact, a large-scale
and high-quality EFD dataset based on the MAVEN dataset. MAVEN-Fact includes
factuality annotations of 112,276 events, making it the largest EFD dataset.
Extensive experiments demonstrate that MAVEN-Fact is challenging for both
conventional fine-tuned models and large language models (LLMs). Thanks to the
comprehensive annotations of event arguments and relations in MAVEN, MAVEN-Fact
also supports some further analyses and we find that adopting event arguments
and relations helps in event factuality detection for fine-tuned models but
does not benefit LLMs. Furthermore, we preliminarily study an application case
of event factuality detection and find it helps in mitigating event-related
hallucination in LLMs. Our dataset and codes can be obtained from
\url{https://github.com/lcy2723/MAVEN-FACT}
comment: Under review
☆ LLMExplainer: Large Language Model based Bayesian Inference for Graph Explanation Generation
Recent studies seek to provide Graph Neural Network (GNN) interpretability
via multiple unsupervised learning models. Due to the scarcity of datasets,
current methods easily suffer from learning bias. To solve this problem, we
embed a Large Language Model (LLM) as knowledge into the GNN explanation
network to avoid the learning bias problem. We inject LLM as a Bayesian
Inference (BI) module to mitigate learning bias. The efficacy of the BI module
has been proven both theoretically and experimentally. We conduct experiments
on both synthetic and real-world datasets. The innovation of our work lies in
two parts: 1. We provide a novel view of the possibility of an LLM functioning
as a Bayesian inference to improve the performance of existing algorithms; 2.
We are the first to discuss the learning bias issues in the GNN explanation
problem.
comment: Preprint Paper with 13 pages
☆ Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models
Wenbin An, Feng Tian, Jiahao Nie, Wenkai Shi, Haonan Lin, Yan Chen, QianYing Wang, Yaqiang Wu, Guang Dai, Ping Chen
Knowledge-based Visual Question Answering (KVQA) requires both image and
world knowledge to answer questions. Current methods first retrieve knowledge
from the image and external knowledge base with the original complex question,
then generate answers with Large Language Models (LLMs). However, since the
original question contains complex elements that require knowledge from
different sources, acquiring different kinds of knowledge in a coupled manner
may confuse models and hinder them from retrieving precise knowledge.
Furthermore, the ``forward-only'' answering process fails to explicitly capture
the knowledge needs of LLMs, which can further hurt answering quality. To cope
with the above limitations, we propose DKA: Disentangled Knowledge Acquisition
from LLM feedback, a training-free framework that disentangles knowledge
acquisition to avoid confusion and uses LLM's feedback to specify the required
knowledge. Specifically, DKA requires LLMs to specify what knowledge they need
to answer the question and decompose the original complex question into two
simple sub-questions: Image-based sub-question and Knowledge-based
sub-question. Then we use the two sub-questions to retrieve knowledge from the
image and knowledge base, respectively. In this way, two knowledge acquisition
models can focus on the content that corresponds to them and avoid disturbance
of irrelevant elements in the original complex question, which can help to
provide more precise knowledge and better align the knowledge needs of LLMs to
yield correct answers. Experiments on benchmark datasets show that DKA
significantly outperforms SOTA models. To facilitate future research, our data
and code are available at \url{https://github.com/Lackel/DKA}.
comment: Pre-print
☆ Improving Minimum Bayes Risk Decoding with Multi-Prompt
While instruction fine-tuned LLMs are effective text generators, sensitivity
to prompt construction makes performance unstable and sub-optimal in practice.
Relying on a single "best" prompt cannot capture all differing approaches to a
generation problem. Using this observation, we propose multi-prompt decoding,
where many candidate generations are decoded from a prompt bank at
inference-time. To ensemble candidates, we use Minimum Bayes Risk (MBR)
decoding, which selects a final output using a trained value metric. We show
multi-prompt improves MBR across a comprehensive set of conditional generation
tasks, and show this is a result of estimating a more diverse and higher
quality candidate space than that of a single prompt. Further experiments
confirm multi-prompt improves generation across tasks, models and metrics.
☆ ZZU-NLP at SIGHAN-2024 dimABSA Task: Aspect-Based Sentiment Analysis with Coarse-to-Fine In-context Learning
The DimABSA task requires fine-grained sentiment intensity prediction for
restaurant reviews, including scores for Valence and Arousal dimensions for
each Aspect Term. In this study, we propose a Coarse-to-Fine In-context
Learning(CFICL) method based on the Baichuan2-7B model for the DimABSA task in
the SIGHAN 2024 workshop. Our method improves prediction accuracy through a
two-stage optimization process. In the first stage, we use fixed in-context
examples and prompt templates to enhance the model's sentiment recognition
capability and provide initial predictions for the test data. In the second
stage, we encode the Opinion field using BERT and select the most similar
training data as new in-context examples based on similarity. These examples
include the Opinion field and its scores, as well as related opinion words and
their average scores. By filtering for sentiment polarity, we ensure that the
examples are consistent with the test data. Our method significantly improves
prediction accuracy and consistency by effectively utilizing training data and
optimizing in-context examples, as validated by experimental results.
☆ Deep Learning for Economists
Deep learning provides powerful methods to impute structured information from
large-scale, unstructured text and image datasets. For example, economists
might wish to detect the presence of economic activity in satellite images, or
to measure the topics or entities mentioned in social media, the congressional
record, or firm filings. This review introduces deep neural networks, covering
methods such as classifiers, regression models, generative AI, and embedding
models. Applications include classification, document digitization, record
linkage, and methods for data exploration in massive scale text and image
corpora. When suitable methods are used, deep learning models can be cheap to
tune and can scale affordably to problems involving millions or billions of
data points.. The review is accompanied by a companion website, EconDL, with
user-friendly demo notebooks, software resources, and a knowledge base that
provides technical details and additional applications.
♻ ☆ Foundation Models for Autonomous Robots in Unstructured Environments
Automating activities through robots in unstructured environments, such as
construction sites, has been a long-standing desire. However, the high degree
of unpredictable events in these settings has resulted in far less adoption
compared to more structured settings, such as manufacturing, where robots can
be hard-coded or trained on narrowly defined datasets. Recently, pretrained
foundation models, such as Large Language Models (LLMs), have demonstrated
superior generalization capabilities by providing zero-shot solutions for
problems do not present in the training data, proposing them as a potential
solution for introducing robots to unstructured environments. To this end, this
study investigates potential opportunities and challenges of pretrained
foundation models from a multi-dimensional perspective. The study
systematically reviews application of foundation models in two field of robotic
and unstructured environment and then synthesized them with deliberative acting
theory. Findings showed that linguistic capabilities of LLMs have been utilized
more than other features for improving perception in human-robot interactions.
On the other hand, findings showed that the use of LLMs demonstrated more
applications in project management and safety in construction, and natural
hazard detection in disaster management. Synthesizing these findings, we
located the current state-of-the-art in this field on a five-level scale of
automation, placing them at conditional automation. This assessment was then
used to envision future scenarios, challenges, and solutions toward autonomous
safe unstructured environments. Our study can be seen as a benchmark to track
our progress toward that future.
comment: arXiv admin note: text overlap with arXiv:2312.07843,
arXiv:2402.05741 by other authors
♻ ☆ Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, Tim Rocktäschel, Roberta Raileanu
As large language models (LLMs) become increasingly prevalent across many
real-world applications, understanding and enhancing their robustness to
adversarial attacks is of paramount importance. Existing methods for
identifying adversarial prompts tend to focus on specific domains, lack
diversity, or require extensive human annotations. To address these
limitations, we present Rainbow Teaming, a novel black-box approach for
producing a diverse collection of adversarial prompts. Rainbow Teaming casts
adversarial prompt generation as a quality-diversity problem, and uses
open-ended search to generate prompts that are both effective and diverse.
Focusing on the safety domain, we use Rainbow Teaming to target various
state-of-the-art LLMs, including the Llama 2 and Llama 3 models. Our approach
reveals hundreds of effective adversarial prompts, with an attack success rate
exceeding 90% across all tested models. Furthermore, we demonstrate that
fine-tuning models with synthetic data generated by the Rainbow Teaming method
significantly enhances their safety without sacrificing general performance or
helpfulness. We additionally explore the versatility of Rainbow Teaming by
applying it to question answering and cybersecurity, showcasing its potential
to drive robust open-ended self-improvement in a wide range of applications.
♻ ☆ ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models
Research on Large Language Models (LLMs) has recently witnessed an increasing
interest in extending models' context size to better capture dependencies
within long documents. While benchmarks have been proposed to assess long-range
abilities, existing efforts primarily considered generic tasks that are not
necessarily aligned with real-world applications. In contrast, our work
proposes a new benchmark for long-context LLMs focused on a practical meeting
assistant scenario. In this scenario, the long contexts consist of transcripts
obtained by automatic speech recognition, presenting unique challenges for LLMs
due to the inherent noisiness and oral nature of such data. Our benchmark,
named ELITR-Bench, augments the existing ELITR corpus' transcripts with 271
manually crafted questions and their ground-truth answers. Our experiments with
recent long-context LLMs on ELITR-Bench highlight a gap between open-source and
proprietary models, especially when questions are asked sequentially within a
conversation. We also provide a thorough analysis of our GPT-4-based evaluation
method, encompassing insights from a crowdsourcing study. Our findings suggest
that while GPT-4's evaluation scores are correlated with human judges', its
ability to differentiate among more than three score levels may be limited.
♻ ☆ Who Shares Fake News? Uncovering Insights from Social Media Users' Post Histories
We propose that social-media users' own post histories are an underused yet
valuable resource for studying fake-news sharing. By extracting textual cues
from their prior posts, and contrasting their prevalence against random
social-media users and others (e.g., those with similar socio-demographics,
political news-sharers, and fact-check sharers), researchers can identify cues
that distinguish fake-news sharers, predict those most likely to share fake
news, and identify promising constructs to build interventions. Our research
includes studies along these lines. In Study 1, we explore the distinctive
language patterns of fake-news sharers, highlighting elements such as their
higher use of anger and power-related words. In Study 2, we show that adding
textual cues into predictive models enhances their accuracy in predicting
fake-news sharers. In Study 3, we explore the contrasting role of trait and
situational anger, and show trait anger is associated with a greater propensity
to share both true and fake news. In Study 4, we introduce a way to
authenticate Twitter accounts in surveys, before using it to explore how
crafting an ad copy that resonates with users' sense of power encourages the
adoption of fact-checking tools. We hope to encourage the use of novel research
methods for marketers and misinformation researchers.
comment: 108 pages
♻ ☆ Beyond Memorization: The Challenge of Random Memory Access in Language Models ACL 2024
Recent developments in Language Models (LMs) have shown their effectiveness
in NLP tasks, particularly in knowledge-intensive tasks. However, the
mechanisms underlying knowledge storage and memory access within their
parameters remain elusive. In this paper, we investigate whether a generative
LM (e.g., GPT-2) is able to access its memory sequentially or randomly. Through
carefully-designed synthetic tasks, covering the scenarios of full recitation,
selective recitation and grounded question answering, we reveal that LMs manage
to sequentially access their memory while encountering challenges in randomly
accessing memorized content. We find that techniques including recitation and
permutation improve the random memory access capability of LMs. Furthermore, by
applying this intervention to realistic scenarios of open-domain question
answering, we validate that enhancing random access by recitation leads to
notable improvements in question answering. The code to reproduce our
experiments can be found at https://github.com/sail-sg/lm-random-memory-access.
comment: 9 pages, 4 figures; accepted by ACL 2024 (oral)
♻ ☆ Fast and Effective Weight Update for Pruned Large Language Models
Pruning large language models (LLMs) is a challenging task due to their
enormous size. The primary difficulty is fine-tuning the model after pruning,
which is needed to recover the lost performance caused by dropping weights.
Recent approaches have either ignored fine-tuning entirely, focusing on
efficient pruning criteria, or attempted layer-wise weight updates, preserving
the behavior of each layer. However, even layer-wise weight updates can be
costly for LLMs, and previous works have resorted to various approximations.
In our paper, we propose a fast and effective weight update algorithm for
pruned layers based on the Alternating Direction Method of Multipliers (ADMM).
We further extend it with a simple gradual pruning mask selection and achieve
state-of-the-art pruning performance across a wide range of LLMs. Code is
available at https://github.com/fmfi-compbio/admm-pruning.
♻ ☆ GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents
Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, Arjun Yadav
Large language models have demonstrated remarkable few-shot performance on
many natural language understanding tasks. Despite several demonstrations of
using large language models in complex, strategic scenarios, there lacks a
comprehensive framework for evaluating agents' performance across various types
of reasoning found in games. To address this gap, we introduce GameBench, a
cross-domain benchmark for evaluating strategic reasoning abilities of LLM
agents. We focus on 9 different game environments, where each covers at least
one axis of key reasoning skill identified in strategy games, and select games
for which strategy explanations are unlikely to form a significant portion of
models' pretraining corpuses. Our evaluations use GPT-3 and GPT-4 in their base
form along with two scaffolding frameworks designed to enhance strategic
reasoning ability: Chain-of-Thought (CoT) prompting and Reasoning Via Planning
(RAP). Our results show that none of the tested models match human performance,
and at worst GPT-4 performs worse than random action. CoT and RAP both improve
scores but not comparable to human levels.
♻ ☆ Chain of Code: Reasoning with a Language Model-Augmented Code Emulator ICML 2024
Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, Brian Ichter
Code provides a general syntactic structure to build complex programs and
perform precise computations when paired with a code interpreter - we
hypothesize that language models (LMs) can leverage code-writing to improve
Chain of Thought reasoning not only for logic and arithmetic tasks, but also
for semantic ones (and in particular, those that are a mix of both). For
example, consider prompting an LM to write code that counts the number of times
it detects sarcasm in an essay: the LM may struggle to write an implementation
for "detect_sarcasm(string)" that can be executed by the interpreter (handling
the edge cases would be insurmountable). However, LMs may still produce a valid
solution if they not only write code, but also selectively "emulate" the
interpreter by generating the expected output of "detect_sarcasm(string)". In
this work, we propose Chain of Code (CoC), a simple yet surprisingly effective
extension that improves LM code-driven reasoning. The key idea is to encourage
LMs to format semantic sub-tasks in a program as flexible pseudocode that the
interpreter can explicitly catch undefined behaviors and hand off to simulate
with an LM (as an "LMulator"). Experiments demonstrate that Chain of Code
outperforms Chain of Thought and other baselines across a variety of
benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over
Chain of Thought. In a nutshell, CoC broadens the scope of reasoning questions
that LMs can answer by "thinking in code".
comment: ICML 2024 Oral; Project webpage: https://chain-of-code.github.io
♻ ☆ Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy
Human forecasting accuracy in practice relies on the 'wisdom of the crowd'
effect, in which predictions about future events are significantly improved by
aggregating across a crowd of individual forecasters. Past work on the
forecasting ability of large language models (LLMs) suggests that frontier
LLMs, as individual forecasters, underperform compared to the gold standard of
a human crowd forecasting tournament aggregate. In Study 1, we expand this
research by using an LLM ensemble approach consisting of a crowd of twelve
LLMs. We compare the aggregated LLM predictions on 31 binary questions to that
of a crowd of 925 human forecasters from a three-month forecasting tournament.
Our preregistered main analysis shows that the LLM crowd outperforms a simple
no-information benchmark and is not statistically different from the human
crowd. In exploratory analyses, we find that these two approaches are
equivalent with respect to medium-effect-size equivalence bounds. We also
observe an acquiescence effect, with mean model predictions being significantly
above 50%, despite an almost even split of positive and negative resolutions.
Moreover, in Study 2, we test whether LLM predictions (of GPT-4 and Claude 2)
can be improved by drawing on human cognitive output. We find that both models'
forecasting accuracy benefits from exposure to the median human prediction as
information, improving accuracy by between 17% and 28%: though this leads to
less accurate predictions than simply averaging human and machine forecasts.
Our results suggest that LLMs can achieve forecasting accuracy rivaling that of
human crowd forecasting tournaments: via the simple, practically applicable
method of forecast aggregation. This replicates the 'wisdom of the crowd'
effect for LLMs, and opens up their use for a variety of applications
throughout society.
comment: 20 pages; 13 visualizations (nine figures, four tables)
♻ ☆ AdaptEval: Evaluating Large Language Models on Domain Adaptation for Text Summarization
Despite the advances in the abstractive summarization task using Large
Language Models (LLM), there is a lack of research that asses their abilities
to easily adapt to different domains. We evaluate the domain adaptation
abilities of a wide range of LLMs on the summarization task across various
domains in both fine-tuning and in-context learning settings. We also present
AdaptEval, the first domain adaptation evaluation suite. AdaptEval includes a
domain benchmark and a set of metrics to facilitate the analysis of domain
adaptation. Our results demonstrate that LLMs exhibit comparable performance in
the in-context learning setting, regardless of their parameter scale.
♻ ☆ Mitigating Entity-Level Hallucination in Large Language Models
The emergence of Large Language Models (LLMs) has revolutionized how users
access information, shifting from traditional search engines to direct
question-and-answer interactions with LLMs. However, the widespread adoption of
LLMs has revealed a significant challenge known as hallucination, wherein LLMs
generate coherent yet factually inaccurate responses. This hallucination
phenomenon has led to users' distrust in information retrieval systems based on
LLMs. To tackle this challenge, this paper proposes Dynamic Retrieval
Augmentation based on hallucination Detection (DRAD) as a novel method to
detect and mitigate hallucinations in LLMs. DRAD improves upon traditional
retrieval augmentation by dynamically adapting the retrieval process based on
real-time hallucination detection. It features two main components: Real-time
Hallucination Detection (RHD) for identifying potential hallucinations without
external models, and Self-correction based on External Knowledge (SEK) for
correcting these errors using external knowledge. Experiment results show that
DRAD demonstrates superior performance in both detecting and mitigating
hallucinations in LLMs. All of our code and data are open-sourced at
https://github.com/oneal2000/EntityHallucination.
♻ ☆ A Survey of Large Language Models in Medicine: Progress, Application, and Challenge
Hongjian Zhou, Fenglin Liu, Boyang Gu, Xinyu Zou, Jinfa Huang, Jinge Wu, Yiru Li, Sam S. Chen, Peilin Zhou, Junling Liu, Yining Hua, Chengfeng Mao, Chenyu You, Xian Wu, Yefeng Zheng, Lei Clifton, Zheng Li, Jiebo Luo, David A. Clifton
Large language models (LLMs), such as ChatGPT, have received substantial
attention due to their capabilities for understanding and generating human
language. While there has been a burgeoning trend in research focusing on the
employment of LLMs in supporting different medical tasks (e.g., enhancing
clinical diagnostics and providing medical education), a review of these
efforts, particularly their development, practical applications, and outcomes
in medicine, remains scarce. Therefore, this review aims to provide a detailed
overview of the development and deployment of LLMs in medicine, including the
challenges and opportunities they face. In terms of development, we provide a
detailed introduction to the principles of existing medical LLMs, including
their basic model structures, number of parameters, and sources and scales of
data used for model development. It serves as a guide for practitioners in
developing medical LLMs tailored to their specific needs. In terms of
deployment, we offer a comparison of the performance of different LLMs across
various medical tasks, and further compare them with state-of-the-art
lightweight models, aiming to provide an understanding of the advantages and
limitations of LLMs in medicine. Overall, in this review, we address the
following questions: 1) What are the practices for developing medical LLMs 2)
How to measure the medical task performance of LLMs in a medical setting? 3)
How have medical LLMs been employed in real-world practice? 4) What challenges
arise from the use of medical LLMs? and 5) How to more effectively develop and
deploy medical LLMs? By answering these questions, this review aims to provide
insights into the opportunities for LLMs in medicine and serve as a practical
resource. We also maintain a regularly updated list of practical guides on
medical LLMs at https://github.com/AI-in-Health/MedLLMsPracticalGuide
comment: Preprint. Version 6. Update Figures 1-5; Tables 2-3; 31 pages
♻ ☆ Cross-Speaker Encoding Network for Multi-Talker Speech Recognition ICASSP2024
End-to-end multi-talker speech recognition has garnered great interest as an
effective approach to directly transcribe overlapped speech from multiple
speakers. Current methods typically adopt either 1) single-input
multiple-output (SIMO) models with a branched encoder, or 2) single-input
single-output (SISO) models based on attention-based encoder-decoder
architecture with serialized output training (SOT). In this work, we propose a
Cross-Speaker Encoding (CSE) network to address the limitations of SIMO models
by aggregating cross-speaker representations. Furthermore, the CSE model is
integrated with SOT to leverage both the advantages of SIMO and SISO while
mitigating their drawbacks. To the best of our knowledge, this work represents
an early effort to integrate SIMO and SISO for multi-talker speech recognition.
Experiments on the two-speaker LibrispeechMix dataset show that the CES model
reduces word error rate (WER) by 8% over the SIMO baseline. The CSE-SOT model
reduces WER by 10% overall and by 16% on high-overlap speech compared to the
SOT model. Code is available at https://github.com/kjw11/CSEnet-ASR.
comment: Accepted by ICASSP2024
♻ ☆ TTSDS -- Text-to-Speech Distribution Score
Many recently published Text-to-Speech (TTS) systems produce audio close to
real speech. However, TTS evaluation needs to be revisited to make sense of the
results obtained with the new architectures, approaches and datasets. We
propose evaluating the quality of synthetic speech as a combination of multiple
factors such as prosody, speaker identity, and intelligibility. Our approach
assesses how well synthetic speech mirrors real speech by obtaining correlates
of each factor and measuring their distance from both real speech datasets and
noise datasets. We benchmark 35 TTS systems developed between 2008 and 2024 and
show that our score computed as an unweighted average of factors strongly
correlates with the human evaluations from each time period.
comment: Under review for SLT 2024
♻ ☆ Adversarial Style Augmentation via Large Language Model for Robust Fake News Detection
The spread of fake news negatively impacts individuals and is regarded as a
significant social challenge that needs to be addressed. A number of
algorithmic and insightful features have been identified for detecting fake
news. However, with the recent LLMs and their advanced generation capabilities,
many of the detectable features (e.g., style-conversion attacks) can be
altered, making it more challenging to distinguish from real news. This study
proposes adversarial style augmentation, AdStyle, to train a fake news detector
that remains robust against various style-conversion attacks. Our model's key
mechanism is the careful use of LLMs to automatically generate a diverse yet
coherent range of style-conversion attack prompts. This improves the generation
of prompts that are particularly difficult for the detector to handle.
Experiments show that our augmentation strategy improves robustness and
detection performance when tested on fake news benchmark datasets.
comment: 8 pages
♻ ☆ Unipa-GPT: Large Language Models for university-oriented QA in Italian
This paper illustrates the architecture and training of Unipa-GPT, a chatbot
relying on a Large Language Model, developed for assisting students in choosing
a bachelor/master degree course at the University of Palermo. Unipa-GPT relies
on gpt-3.5-turbo, it was presented in the context of the European Researchers'
Night (SHARPER night). In our experiments we adopted both the Retrieval
Augmented Generation (RAG) approach and fine-tuning to develop the system. The
whole architecture of Unipa-GPT is presented, both the RAG and the fine-tuned
systems are compared, and a brief discussion on their performance is reported.
Further comparison with other Large Language Models and the experimental
results during the SHARPER night are illustrated.
♻ ☆ MAPLE: Multilingual Evaluation of Parameter Efficient Finetuning of Large Language Models ACL 2024
Parameter Efficient Finetuning (PEFT) has emerged as a viable solution for
improving the performance of Large Language Models (LLMs) without requiring
massive resources and compute. Prior work on multilingual evaluation has shown
that there is a large gap between the performance of LLMs on English and other
languages. Further, there is also a large gap between the performance of
smaller open-source models and larger LLMs. Finetuning can be an effective way
to bridge this gap and make language models more equitable. In this work, we
finetune the LLama-2-7B and Mistral-7B models on two synthetic multilingual
instruction tuning datasets to determine its effect on model performance on six
downstream tasks covering forty languages in all. Additionally, we experiment
with various parameters, such as rank for low-rank adaptation and values of
quantisation to determine their effects on downstream performance and find that
higher rank and higher quantisation values benefit low-resource languages. We
find that PEFT of smaller open-source models sometimes bridges the gap between
the performance of these models and the larger ones, however, English
performance can take a hit. We also find that finetuning sometimes improves
performance on low-resource languages, while degrading performance on
high-resource languages.
comment: 46 pages, 23 figures, 45 tables. Accepted in ACL 2024 findings
♻ ☆ General-Purpose Retrieval-Enhanced Medical Prediction Model Using Near-Infinite History
Machine learning (ML) has recently shown promising results in medical
predictions using electronic health records (EHRs). However, since ML models
typically have a limited capability in terms of input sizes, selecting specific
medical events from EHRs for use as input is necessary. This selection process,
often relying on expert opinion, can cause bottlenecks in development. We
propose Retrieval-Enhanced Medical prediction model (REMed) to address such
challenges. REMed can essentially evaluate unlimited medical events, select the
relevant ones, and make predictions. This allows for an unrestricted input
size, eliminating the need for manual event selection. We verified these
properties through experiments involving 27 clinical prediction tasks across
four independent cohorts, where REMed outperformed the baselines. Notably, we
found that the preferences of REMed align closely with those of medical
experts. We expect our approach to significantly expedite the development of
EHR prediction models by minimizing clinicians' need for manual involvement.
comment: The source codes corresponding to this paper are available at:
https://github.com/starmpcc/REMed
♻ ☆ Key-Point-Driven Mathematical Reasoning Distillation of Large Language Model
Large Language Models (LLMs) have demonstrated exceptional proficiency in
mathematical reasoning tasks due to their extensive parameter counts and
training on vast datasets. Despite these capabilities, deploying LLMs is
hindered by their computational demands. Distilling LLM mathematical reasoning
into Smaller Language Models (SLMs) has emerged as a solution to this
challenge, although these smaller models often suffer from errors in
calculation and semantic understanding. Prior work has proposed
Program-of-Thought Distillation (PoTD) to avoid calculation error. To further
address semantic understanding errors, we propose Key-Point-Driven Mathematical
Reasoning Distillation (KPDD). KPDD enhances the reasoning performance of SLMs
by breaking down the problem-solving process into three stages: Core Question
Extraction, Problem-Solving Information Extraction, and Step-by-Step Solution.
This method is further divided into KPDD-CoT, which generates Chain-of-Thought
rationales, and KPDD-PoT, which creates Program-of-Thought rationales. The
experiment results show that KPDD-CoT significantly improves reasoning
abilities, while KPDD-PoT achieves state-of-the-art performance in mathematical
reasoning tasks. Our approach effectively mitigates misunderstanding errors,
advancing the deployment of efficient and capable SLMs.
♻ ☆ Meta-Task Prompting Elicits Embeddings from Large Language Models ACL 2024
We introduce a new unsupervised text embedding method, Meta-Task Prompting
with Explicit One-Word Limitation (MetaEOL), for generating high-quality
sentence embeddings from Large Language Models (LLMs) without the need for
model fine-tuning. Leveraging meta-task prompting, MetaEOL guides LLMs to
produce embeddings through a series of carefully designed prompts that address
multiple representational aspects. Our comprehensive experiments demonstrate
that embeddings averaged from various meta-tasks are versatile embeddings that
yield competitive performance on Semantic Textual Similarity (STS) benchmarks
and excel in downstream tasks, surpassing contrastive-trained models. Our
findings suggest a new scaling law, offering a versatile and resource-efficient
approach for embedding generation across diverse scenarios.
comment: ACL 2024
♻ ☆ EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation ACL 2022
Complete Multi-lingual Neural Machine Translation (C-MNMT) achieves superior
performance against the conventional MNMT by constructing multi-way aligned
corpus, i.e., aligning bilingual training examples from different language
pairs when either their source or target sides are identical. However, since
exactly identical sentences from different language pairs are scarce, the power
of the multi-way aligned corpus is limited by its scale. To handle this
problem, this paper proposes "Extract and Generate" (EAG), a two-step approach
to construct large-scale and high-quality multi-way aligned corpus from
bilingual data. Specifically, we first extract candidate aligned examples by
pairing the bilingual examples from different language pairs with highly
similar source or target sentences; and then generate the final aligned
examples from the candidates with a well-trained generation model. With this
two-step pipeline, EAG can construct a large-scale and multi-way aligned corpus
whose diversity is almost identical to the original bilingual corpus.
Experiments on two publicly available datasets i.e., WMT-5 and OPUS-100, show
that the proposed method achieves significant improvements over strong
baselines, with +1.1 and +1.4 BLEU points improvements on the two datasets
respectively.
comment: Accepted as a long paper at ACL 2022
♻ ☆ From Black Boxes to Conversations: Incorporating XAI in a Conversational Agent
The goal of Explainable AI (XAI) is to design methods to provide insights
into the reasoning process of black-box models, such as deep neural networks,
in order to explain them to humans. Social science research states that such
explanations should be conversational, similar to human-to-human explanations.
In this work, we show how to incorporate XAI in a conversational agent, using a
standard design for the agent comprising natural language understanding and
generation components. We build upon an XAI question bank, which we extend by
quality-controlled paraphrases, to understand the user's information needs. We
further systematically survey the literature for suitable explanation methods
that provide the information to answer those questions, and present a
comprehensive list of suggestions. Our work is the first step towards truly
natural conversations about machine learning models with an explanation agent.
The comprehensive list of XAI questions and the corresponding explanation
methods may support other researchers in providing the necessary information to
address users' demands. To facilitate future work, we release our source code
and data.
comment: Accepted at The World Conference on eXplainable Artificial
Intelligence 2023 (XAI-2023)
♻ ☆ Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models ICML 2024
Watermarking generative models consists of planting a statistical signal
(watermark) in a model's output so that it can be later verified that the
output was generated by the given model. A strong watermarking scheme satisfies
the property that a computationally bounded attacker cannot erase the watermark
without causing significant quality degradation. In this paper, we study the
(im)possibility of strong watermarking schemes. We prove that, under
well-specified and natural assumptions, strong watermarking is impossible to
achieve. This holds even in the private detection algorithm setting, where the
watermark insertion and detection algorithms share a secret key, unknown to the
attacker. To prove this result, we introduce a generic efficient watermark
attack; the attacker is not required to know the private key of the scheme or
even which scheme is used. Our attack is based on two assumptions: (1) The
attacker has access to a "quality oracle" that can evaluate whether a candidate
output is a high-quality response to a prompt, and (2) The attacker has access
to a "perturbation oracle" which can modify an output with a nontrivial
probability of maintaining quality, and which induces an efficiently mixing
random walk on high-quality outputs. We argue that both assumptions can be
satisfied in practice by an attacker with weaker computational capabilities
than the watermarked model itself, to which the attacker has only black-box
access. Furthermore, our assumptions will likely only be easier to satisfy over
time as models grow in capabilities and modalities. We demonstrate the
feasibility of our attack by instantiating it to attack three existing
watermarking schemes for large language models: Kirchenbauer et al. (2023),
Kuditipudi et al. (2023), and Zhao et al. (2023). The same attack successfully
removes the watermarks planted by all three schemes, with only minor quality
degradation.
comment: ICML 2024. Website: https://hanlin-zhang.com/impossibility-watermarks
♻ ☆ TokenSHAP: Interpreting Large Language Models with Monte Carlo Shapley Value Estimation
As large language models (LLMs) become increasingly prevalent in critical
applications, the need for interpretable AI has grown. We introduce TokenSHAP,
a novel method for interpreting LLMs by attributing importance to individual
tokens or substrings within input prompts. This approach adapts Shapley values
from cooperative game theory to natural language processing, offering a
rigorous framework for understanding how different parts of an input contribute
to a model's response. TokenSHAP leverages Monte Carlo sampling for
computational efficiency, providing interpretable, quantitative measures of
token importance. We demonstrate its efficacy across diverse prompts and LLM
architectures, showing consistent improvements over existing baselines in
alignment with human judgments, faithfulness to model behavior, and
consistency.
Our method's ability to capture nuanced interactions between tokens provides
valuable insights into LLM behavior, enhancing model transparency, improving
prompt engineering, and aiding in the development of more reliable AI systems.
TokenSHAP represents a significant step towards the necessary interpretability
for responsible AI deployment, contributing to the broader goal of creating
more transparent, accountable, and trustworthy AI systems.
♻ ☆ MarkLLM: An Open-Source Toolkit for LLM Watermarking
Leyi Pan, Aiwei Liu, Zhiwei He, Zitian Gao, Xuandong Zhao, Yijian Lu, Binglin Zhou, Shuliang Liu, Hanlin Zhang, Xuming Hu, Lijie Wen, Irwin King
LLM watermarking, which embeds imperceptible yet algorithmically detectable
signals in model outputs to identify LLM-generated text, has become crucial in
mitigating the potential misuse of large language models. However, the
abundance of LLM watermarking algorithms, their intricate mechanisms, and the
complex evaluation procedures and perspectives pose challenges for researchers
and the community to easily experiment with, understand, and assess the latest
advancements. To address these issues, we introduce MarkLLM, an open-source
toolkit for LLM watermarking. MarkLLM offers a unified and extensible framework
for implementing LLM watermarking algorithms, while providing user-friendly
interfaces to ensure ease of access. Furthermore, it enhances understanding by
supporting automatic visualization of the underlying mechanisms of these
algorithms. For evaluation, MarkLLM offers a comprehensive suite of 12 tools
spanning three perspectives, along with two types of automated evaluation
pipelines. Through MarkLLM, we aim to support researchers while improving the
comprehension and involvement of the general public in LLM watermarking
technology, fostering consensus and driving further advancements in research
and application. Our code is available at https://github.com/THU-BPM/MarkLLM.
comment: 16 pages, 5 figures, 6 tables
♻ ☆ UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs ACL 2024
Chaoqun He, Renjie Luo, Shengding Hu, Yuanqian Zhao, Jie Zhou, Hanghao Wu, Jiajie Zhang, Xu Han, Zhiyuan Liu, Maosong Sun
Evaluation is pivotal for refining Large Language Models (LLMs), pinpointing
their capabilities, and guiding enhancements. The rapid development of LLMs
calls for a lightweight and easy-to-use framework for swift evaluation
deployment. However, considering various implementation details, developing a
comprehensive evaluation platform is never easy. Existing platforms are often
complex and poorly modularized, hindering seamless incorporation into research
workflows. This paper introduces UltraEval, a user-friendly evaluation
framework characterized by its lightweight nature, comprehensiveness,
modularity, and efficiency. We identify and reimplement three core components
of model evaluation (models, data, and metrics). The resulting composability
allows for the free combination of different models, tasks, prompts,
benchmarks, and metrics within a unified evaluation workflow. Additionally,
UltraEval supports diverse models owing to a unified HTTP service and provides
sufficient inference acceleration. UltraEval is now available for researchers
publicly.
comment: Accepted by ACL 2024 System Demostration Track, update
♻ ☆ Video Understanding with Large Language Models: A Survey
Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu
With the burgeoning growth of online video platforms and the escalating
volume of video content, the demand for proficient video understanding tools
has intensified markedly. Given the remarkable capabilities of large language
models (LLMs) in language and multimodal tasks, this survey provides a detailed
overview of recent advancements in video understanding that harness the power
of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly
advanced, particularly their ability for open-ended multi-granularity (general,
temporal, and spatiotemporal) reasoning combined with commonsense knowledge,
suggesting a promising path for future video understanding. We examine the
unique characteristics and capabilities of Vid-LLMs, categorizing the
approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM,
and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based
on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as
Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this
survey presents a comprehensive study of the tasks, datasets, benchmarks, and
evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive
applications of Vid-LLMs across various domains, highlighting their remarkable
scalability and versatility in real-world video understanding challenges.
Finally, it summarizes the limitations of existing Vid-LLMs and outlines
directions for future research. For more information, readers are recommended
to visit the repository at
https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.
♻ ☆ A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
Current Large Language Models (LLMs) are not only limited to some maximum
context length, but also are not able to robustly consume long inputs. To
address these limitations, we propose ReadAgent, an LLM agent system that
increases effective context length up to 20x in our experiments. Inspired by
how humans interactively read long documents, we implement ReadAgent as a
simple prompting system that uses the advanced language capabilities of LLMs to
(1) decide what content to store together in a memory episode, (2) compress
those memory episodes into short episodic memories called gist memories, and
(3) take actions to look up passages in the original text if ReadAgent needs to
remind itself of relevant details to complete a task. We evaluate ReadAgent
against baselines using retrieval methods, using the original long contexts,
and using the gist memories. These evaluations are performed on three
long-document reading comprehension tasks: QuALITY, NarrativeQA, and QMSum.
ReadAgent outperforms the baselines on all three tasks while extending the
effective context window by 3.5-20x.
comment: Website: https://read-agent.github.io
♻ ☆ FineSurE: Fine-grained Summarization Evaluation using LLMs ACL 2024
Automated evaluation is crucial for streamlining text summarization
benchmarking and model development, given the costly and time-consuming nature
of human evaluation. Traditional methods like ROUGE do not correlate well with
human judgment, while recently proposed LLM-based metrics provide only
summary-level assessment using Likert-scale scores. This limits deeper model
analysis, e.g., we can only assign one hallucination score at the summary
level, while at the sentence level, we can count sentences containing
hallucinations. To remedy those limitations, we propose FineSurE, a
fine-grained evaluator specifically tailored for the summarization task using
large language models (LLMs). It also employs completeness and conciseness
criteria, in addition to faithfulness, enabling multi-dimensional assessment.
We compare various open-source and proprietary LLMs as backbones for FineSurE.
In addition, we conduct extensive benchmarking of FineSurE against SOTA methods
including NLI-, QA-, and LLM-based methods, showing improved performance
especially on the completeness and conciseness dimensions. The code is
available at https://github.com/DISL-Lab/FineSurE-ACL24.
comment: Accepted at ACL 2024 (main, long)
♻ ☆ Query-OPT: Optimizing Inference of Large Language Models via Multi-Query Instructions in Meeting Summarization
This work focuses on the task of query-based meeting summarization in which
the summary of a context (meeting transcript) is generated in response to a
specific query. When using Large Language Models (LLMs) for this task, usually
a new call to the LLM inference endpoint/API is triggered for each new query,
even if the context stays the same. However, repeated calls to the LLM
inference endpoints would significantly increase the costs of using them in
production, making LLMs impractical for many real-world use cases. To address
this problem, in this paper, we investigate whether combining the queries for
the same input context in a single prompt to minimize repeated calls can be
successfully used in meeting summarization. In this regard, we conduct
extensive experiments by comparing the performance of various popular LLMs:
GPT-4, Gemini, Claude-3, LLaMA-2, Mistral, Phi-3, and Qwen-2 in single-query
and multi-query settings. We observe that 100% reliability in generating the
response in the expected format is usually limited to certain closed-source
LLMs, with most open-source LLMs lagging behind (except a few 7B parameters
LLMs like Mistral and Phi-3). We conclude that multi-query prompting could be
useful to significantly optimize the inference costs in meeting summarization.
♻ ☆ GSQA: An End-to-End Model for Generative Spoken Question Answering
In recent advancements in spoken question answering (QA), end-to-end models
have made significant strides. However, previous research has primarily focused
on extractive span selection. While this extractive-based approach is effective
when answers are present directly within the input, it falls short in
addressing abstractive questions, where answers are not directly extracted but
inferred from the given information. To bridge this gap, we introduce the first
end-to-end Generative Spoken Question Answering (GSQA) model that empowers the
system to engage in abstractive reasoning. The challenge in training our GSQA
model lies in the absence of a spoken abstractive QA dataset. We propose using
text models for initialization and leveraging the extractive QA dataset to
transfer knowledge from the text generative model to the spoken generative
model. Experimental results indicate that our model surpasses the previous
extractive model by 3% on extractive QA datasets. Furthermore, the GSQA model
has only been fine-tuned on the spoken extractive QA dataset. Despite not
having seen any spoken abstractive QA data, it can still closely match the
performance of the cascade model. In conclusion, our GSQA model shows the
potential to generalize to a broad spectrum of questions, thus further
expanding the spoken question answering capabilities of abstractive QA. Our
code is available at https://voidful.github.io/GSQA
comment: 5 pages, 2 figures, Interspeech 2024
♻ ☆ Sketch-Guided Constrained Decoding for Boosting Blackbox Large Language Models without Logit Access ACL 2024
Constrained decoding, a technique for enforcing constraints on language model
outputs, offers a way to control text generation without retraining or
architectural modifications. Its application is, however, typically restricted
to models that give users access to next-token distributions (usually via
softmax logits), which poses a limitation with blackbox large language models
(LLMs). This paper introduces sketch-guided constrained decoding (SGCD), a
novel approach to constrained decoding for blackbox LLMs, which operates
without access to the logits of the blackbox LLM. SGCD utilizes a locally
hosted auxiliary model to refine the output of an unconstrained blackbox LLM,
effectively treating this initial output as a "sketch" for further elaboration.
This approach is complementary to traditional logit-based techniques and
enables the application of constrained decoding in settings where full model
transparency is unavailable. We demonstrate the efficacy of SGCD through
experiments in closed information extraction and constituency parsing, showing
how it enhances the utility and flexibility of blackbox LLMs for complex NLP
tasks.
comment: Accepted to ACL 2024 Oral
♻ ☆ $\forall$uto$\exists$val: Autonomous Assessment of LLMs in Formal Synthesis and Interpretation Tasks
This paper presents $\forall$uto$\exists$val, a new approach for scaling LLM
assessment in translating formal syntax -- such as first-order logic, regular
expressions, etc -- to natural language (interpretation) or vice versa
(compilation), thereby facilitating their use in applications such as
generating/explaining logic and control flow for programs etc. Existing
approaches for LLM assessment in these areas require labor-intensive
ground-truth creation, the availability of which undermines the separation of
training and test sets. Furthermore, such datasets typically include relatively
few hand-coded test cases over which LLM accuracy is determined, thus making
them inadequate for determining the safety or correctness of their generated
outputs. We introduce a new approach that utilizes context-free grammars (CFGs)
to generate out-of-distribution datasets on the fly and perform closed-loop
testing of LLM capabilities using formal verifiers to guarantee the correctness
of LLM outputs without any human intervention. We release our dataset and
benchmark as open-source code at
\url{https://github.com/AAIR-lab/auto-llm-assessment}. We also conduct an
assessment of several SOTA closed and open-source LLMs to showcase the
feasibility and scalability of this paradigm. Our experiments reveal that SOTA
LLMs are unable to solve the formal translation task adequately.
♻ ☆ MLRegTest: A Benchmark for the Machine Learning of Regular Languages
Sam van der Poel, Dakotah Lambert, Kalina Kostyszyn, Tiantian Gao, Rahul Verma, Derek Andersen, Joanne Chau, Emily Peterson, Cody St. Clair, Paul Fodor, Chihiro Shibata, Jeffrey Heinz
Synthetic datasets constructed from formal languages allow fine-grained
examination of the learning and generalization capabilities of machine learning
systems for sequence classification. This article presents a new benchmark for
machine learning systems on sequence classification called MLRegTest, which
contains training, development, and test sets from 1,800 regular languages.
Different kinds of formal languages represent different kinds of long-distance
dependencies, and correctly identifying long-distance dependencies in sequences
is a known challenge for ML systems to generalize successfully. MLRegTest
organizes its languages according to their logical complexity (monadic second
order, first order, propositional, or monomial expressions) and the kind of
logical literals (string, tier-string, subsequence, or combinations thereof).
The logical complexity and choice of literal provides a systematic way to
understand different kinds of long-distance dependencies in regular languages,
and therefore to understand the capacities of different ML systems to learn
such long-distance dependencies. Finally, the performance of different neural
networks (simple RNN, LSTM, GRU, transformer) on MLRegTest is examined. The
main conclusion is that performance depends significantly on the kind of test
set, the class of language, and the neural network architecture.
comment: 43 pages, MLRegTest benchmark available at
https://doi.org/10.5061/dryad.dncjsxm4h , associated code at
https://github.com/heinz-jeffrey/subregular-learning