Computation and Language
☆ How Far Are We From AGI
The evolution of artificial intelligence (AI) has profoundly impacted human
society, driving significant advancements in multiple sectors. Yet, the
escalating demands on AI have highlighted the limitations of AI's current
offerings, catalyzing a movement towards Artificial General Intelligence (AGI).
AGI, distinguished by its ability to execute diverse real-world tasks with
efficiency and effectiveness comparable to human intelligence, reflects a
paramount milestone in AI evolution. While existing works have summarized
specific recent advancements of AI, they lack a comprehensive discussion of
AGI's definitions, goals, and developmental trajectories. Different from
existing survey papers, this paper delves into the pivotal questions of our
proximity to AGI and the strategies necessary for its realization through
extensive surveys, discussions, and original perspectives. We start by
articulating the requisite capability frameworks for AGI, integrating the
internal, interface, and system dimensions. As the realization of AGI requires
more advanced capabilities and adherence to stringent constraints, we further
discuss necessary AGI alignment technologies to harmonize these factors.
Notably, we emphasize the importance of approaching AGI responsibly by first
defining the key levels of AGI progression, followed by the evaluation
framework that situates the status-quo, and finally giving our roadmap of how
to reach the pinnacle of AGI. Moreover, to give tangible insights into the
ubiquitous impact of the integration of AI, we outline existing challenges and
potential pathways toward AGI in multiple domains. In sum, serving as a
pioneering exploration into the current state and future trajectory of AGI,
this paper aims to foster a collective comprehension and catalyze broader
public discussions among researchers and practitioners on AGI.
★ Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey Levine
Large vision-language models (VLMs) fine-tuned on specialized visual
instruction-following data have exhibited impressive language reasoning
capabilities across various scenarios. However, this fine-tuning paradigm may
not be able to efficiently learn optimal decision-making agents in multi-step
goal-directed tasks from interactive environments. To address this challenge,
we propose an algorithmic framework that fine-tunes VLMs with reinforcement
learning (RL). Specifically, our framework provides a task description and then
prompts the VLM to generate chain-of-thought (CoT) reasoning, enabling the VLM
to efficiently explore intermediate reasoning steps that lead to the final
text-based action. Next, the open-ended text output is parsed into an
executable action to interact with the environment to obtain goal-directed task
rewards. Finally, our framework uses these task rewards to fine-tune the entire
VLM with RL. Empirically, we demonstrate that our proposed framework enhances
the decision-making capabilities of VLM agents across various tasks, enabling
7b models to outperform commercial models such as GPT4-V or Gemini.
Furthermore, we find that CoT reasoning is a crucial component for performance
improvement, as removing the CoT reasoning results in a significant decrease in
the overall performance of our method.
☆ Timeline-based Sentence Decomposition with In-Context Learning for Temporal Fact Extraction ACL2024
Facts extraction is pivotal for constructing knowledge graphs. Recently, the
increasing demand for temporal facts in downstream tasks has led to the
emergence of the task of temporal fact extraction. In this paper, we
specifically address the extraction of temporal facts from natural language
text. Previous studies fail to handle the challenge of establishing
time-to-fact correspondences in complex sentences. To overcome this hurdle, we
propose a timeline-based sentence decomposition strategy using large language
models (LLMs) with in-context learning, ensuring a fine-grained understanding
of the timeline associated with various facts. In addition, we evaluate the
performance of LLMs for direct temporal fact extraction and get unsatisfactory
results. To this end, we introduce TSDRE, a method that incorporates the
decomposition capabilities of LLMs into the traditional fine-tuning of smaller
pre-trained language models (PLMs). To support the evaluation, we construct
ComplexTRED, a complex temporal fact extraction dataset. Our experiments show
that TSDRE achieves state-of-the-art results on both HyperRED-Temporal and
ComplexTRED datasets.
comment: Accepted to ACL2024 main conference
☆ Revisiting OPRO: The Limitations of Small-Scale LLMs as Optimizers
Numerous recent works aim to enhance the efficacy of Large Language Models
(LLMs) through strategic prompting. In particular, the Optimization by
PROmpting (OPRO) approach provides state-of-the-art performance by leveraging
LLMs as optimizers where the optimization task is to find instructions that
maximize the task accuracy. In this paper, we revisit OPRO for automated
prompting with relatively small-scale LLMs, such as LLaMa-2 family and Mistral
7B. Our investigation reveals that OPRO shows limited effectiveness in
small-scale LLMs, with limited inference capabilities constraining optimization
ability. We suggest future automatic prompting engineering to consider both
model capabilities and computational costs. Additionally, for small-scale LLMs,
we recommend direct instructions that clearly outline objectives and
methodologies as robust prompt baselines, ensuring efficient and effective
prompt engineering in ongoing research.
☆ A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision
Charles Raude, K R Prajwal, Liliane Momeni, Hannah Bull, Samuel Albanie, Andrew Zisserman, Gül Varol
In this work, our goals are two fold: large-vocabulary continuous sign
language recognition (CSLR), and sign language retrieval. To this end, we
introduce a multi-task Transformer model, CSLR2, that is able to ingest a
signing sequence and output in a joint embedding space between signed language
and spoken language text. To enable CSLR evaluation in the large-vocabulary
setting, we introduce new dataset annotations that have been manually
collected. These provide continuous sign-level annotations for six hours of
test videos, and will be made publicly available. We demonstrate that by a
careful choice of loss functions, training the model for both the CSLR and
retrieval tasks is mutually beneficial in terms of performance -- retrieval
improves CSLR performance by providing context, while CSLR improves retrieval
with more fine-grained supervision. We further show the benefits of leveraging
weak and noisy supervision from large-vocabulary datasets such as BOBSL, namely
sign-level pseudo-labels, and English subtitles. Our model significantly
outperforms the previous state of the art on both tasks.
☆ Keep It Private: Unsupervised Privatization of Online Text
Authorship obfuscation techniques hold the promise of helping people protect
their privacy in online communications by automatically rewriting text to hide
the identity of the original author. However, obfuscation has been evaluated in
narrow settings in the NLP literature and has primarily been addressed with
superficial edit operations that can lead to unnatural outputs. In this work,
we introduce an automatic text privatization framework that fine-tunes a large
language model via reinforcement learning to produce rewrites that balance
soundness, sense, and privacy. We evaluate it extensively on a large-scale test
set of English Reddit posts by 68k authors composed of short-medium length
texts. We study how the performance changes among evaluative conditions
including authorial profile length and authorship detection strategy. Our
method maintains high text quality according to both automated metrics and
human evaluation, and successfully evades several automated authorship attacks.
comment: 17 pages, 6 figures
☆ A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks CCL2023
Recent efforts have evaluated large language models (LLMs) in areas such as
commonsense reasoning, mathematical reasoning, and code generation. However, to
the best of our knowledge, no work has specifically investigated the
performance of LLMs in natural language generation (NLG) tasks, a pivotal
criterion for determining model excellence. Thus, this paper conducts a
comprehensive evaluation of well-known and high-performing LLMs, namely
ChatGPT, ChatGLM, T5-based models, LLaMA-based models, and Pythia-based models,
in the context of NLG tasks. We select English and Chinese datasets
encompassing Dialogue Generation and Text Summarization. Moreover, we propose a
common evaluation setting that incorporates input templates and post-processing
strategies. Our study reports both automatic results, accompanied by a detailed
analysis.
comment: CCL2023
☆ Words as Trigger Points in Social Media Discussions
Dimosthenis Antypas, Christian Arnold, Jose Camacho-Collados, Nedjma Ousidhoum, Carla Perez Almendros
Trigger points are a concept introduced by Mau, Lux, and Westheuser (2023) to
study qualitative focus group interviews and understand polarisation in
Germany. When people communicate, trigger points represent moments when
individuals feel that their understanding of what is fair, normal, or
appropriate in society is questioned. In the original studies, individuals
react affectively to such triggers and show strong and negative emotional
responses. In this paper, we introduce the first systematic study of the
large-scale effect of individual words as trigger points by analysing a large
amount of social media posts. We examine online deliberations on Reddit between
2020 and 2022 and collect >100 million posts from subreddits related to a set
of words identified as trigger points in UK politics. We find that such trigger
words affect user engagement and have noticeable consequences on animosity in
online discussions. We share empirical evidence of trigger words causing
animosity, and how they provide incentives for hate speech, adversarial
debates, and disagreements. Our work is the first to introduce trigger points
to computational studies of online communication. Our findings are relevant to
researchers interested in online harms and who examine how citizens debate
politics and society in light of affective polarisation.
☆ CPsyExam: A Chinese Benchmark for Evaluating Psychology using Examinations
Jiahao Zhao, Jingwei Zhu, Minghuan Tan, Min Yang, Di Yang, Chenhao Zhang, Guancheng Ye, Chengming Li, Xiping Hu
In this paper, we introduce a novel psychological benchmark, CPsyExam,
constructed from questions sourced from Chinese language examinations. CPsyExam
is designed to prioritize psychological knowledge and case analysis separately,
recognizing the significance of applying psychological knowledge to real-world
scenarios. From the pool of 22k questions, we utilize 4k to create the
benchmark that offers balanced coverage of subjects and incorporates a diverse
range of case analysis techniques.Furthermore, we evaluate a range of existing
large language models~(LLMs), spanning from open-sourced to API-based models.
Our experiments and analysis demonstrate that CPsyExam serves as an effective
benchmark for enhancing the understanding of psychology within LLMs and enables
the comparison of LLMs across various granularities.
☆ Building a Luganda Text-to-Speech Model From Crowdsourced Data ICLR 2024
Text-to-speech (TTS) development for African languages such as Luganda is
still limited, primarily due to the scarcity of high-quality, single-speaker
recordings essential for training TTS models. Prior work has focused on
utilizing the Luganda Common Voice recordings of multiple speakers aged between
20-49. Although the generated speech is intelligible, it is still of lower
quality than the model trained on studio-grade recordings. This is due to the
insufficient data preprocessing methods applied to improve the quality of the
Common Voice recordings. Furthermore, speech convergence is more difficult to
achieve due to varying intonations, as well as background noise. In this paper,
we show that the quality of Luganda TTS from Common Voice can improve by
training on multiple speakers of close intonation in addition to further
preprocessing of the training data. Specifically, we selected six female
speakers with close intonation determined by subjectively listening and
comparing their voice recordings. In addition to trimming out silent portions
from the beginning and end of the recordings, we applied a pre-trained speech
enhancement model to reduce background noise and enhance audio quality. We also
utilized a pre-trained, non-intrusive, self-supervised Mean Opinion Score (MOS)
estimation model to filter recordings with an estimated MOS over 3.5,
indicating high perceived quality. Subjective MOS evaluations from nine native
Luganda speakers demonstrate that our TTS model achieves a significantly better
MOS of 3.55 compared to the reported 2.5 MOS of the existing model. Moreover,
for a fair comparison, our model trained on six speakers outperforms models
trained on a single-speaker (3.13 MOS) or two speakers (3.22 MOS). This
showcases the effectiveness of compensating for the lack of data from one
speaker with data from multiple speakers of close intonation to improve TTS
quality.
comment: Presented at the AfricaNLP workshop at ICLR 2024
☆ Hierarchical Attention Graph for Scientific Document Summarization in Global and Local Level NAACL 2024
Scientific document summarization has been a challenging task due to the long
structure of the input text. The long input hinders the simultaneous effective
modeling of both global high-order relations between sentences and local
intra-sentence relations which is the most critical step in extractive
summarization. However, existing methods mostly focus on one type of relation,
neglecting the simultaneous effective modeling of both relations, which can
lead to insufficient learning of semantic representations. In this paper, we
propose HAESum, a novel approach utilizing graph neural networks to locally and
globally model documents based on their hierarchical discourse structure.
First, intra-sentence relations are learned using a local heterogeneous graph.
Subsequently, a novel hypergraph self-attention layer is introduced to further
enhance the characterization of high-order inter-sentence relations. We
validate our approach on two benchmark datasets, and the experimental results
demonstrate the effectiveness of HAESum and the importance of considering
hierarchical structures in modeling long scientific documents. Our code will be
available at \url{https://github.com/MoLICHENXI/HAESum}
comment: Accepted to NAACL 2024 Findings
☆ LFED: A Literary Fiction Evaluation Dataset for Large Language Models
The rapid evolution of large language models (LLMs) has ushered in the need
for comprehensive assessments of their performance across various dimensions.
In this paper, we propose LFED, a Literary Fiction Evaluation Dataset, which
aims to evaluate the capability of LLMs on the long fiction comprehension and
reasoning. We collect 95 literary fictions that are either originally written
in Chinese or translated into Chinese, covering a wide range of topics across
several centuries. We define a question taxonomy with 8 question categories to
guide the creation of 1,304 questions. Additionally, we conduct an in-depth
analysis to ascertain how specific attributes of literary fictions (e.g., novel
types, character numbers, the year of publication) impact LLM performance in
evaluations. Through a series of experiments with various state-of-the-art
LLMs, we demonstrate that these models face considerable challenges in
effectively addressing questions related to literary fictions, with ChatGPT
reaching only 57.08% under the zero-shot setting. The dataset will be publicly
available at https://github.com/tjunlp-lab/LFED.git
☆ Speaker Verification in Agent-Generated Conversations
The recent success of large language models (LLMs) has attracted widespread
interest to develop role-playing conversational agents personalized to the
characteristics and styles of different speakers to enhance their abilities to
perform both general and special purpose dialogue tasks. However, the ability
to personalize the generated utterances to speakers, whether conducted by human
or LLM, has not been well studied. To bridge this gap, our study introduces a
novel evaluation challenge: speaker verification in agent-generated
conversations, which aimed to verify whether two sets of utterances originate
from the same speaker. To this end, we assemble a large dataset collection
encompassing thousands of speakers and their utterances. We also develop and
evaluate speaker verification models under experiment setups. We further
utilize the speaker verification models to evaluate the personalization
abilities of LLM-based role-playing models. Comprehensive experiments suggest
that the current role-playing models fail in accurately mimicking speakers,
primarily due to their inherent linguistic characteristics.
☆ PL-MTEB: Polish Massive Text Embedding Benchmark
In this paper, we introduce the Polish Massive Text Embedding Benchmark
(PL-MTEB), a comprehensive benchmark for text embeddings in Polish. The PL-MTEB
consists of 28 diverse NLP tasks from 5 task types. We adapted the tasks based
on previously used datasets by the Polish NLP community. In addition, we
created a new PLSC (Polish Library of Science Corpus) dataset consisting of
titles and abstracts of scientific publications in Polish, which was used as
the basis for two novel clustering tasks. We evaluated 15 publicly available
models for text embedding, including Polish and multilingual ones, and
collected detailed results for individual tasks and aggregated results for each
task type and the entire benchmark. PL-MTEB comes with open-source code at
https://github.com/rafalposwiata/pl-mteb.
comment: 10 pages, 6 tables, 1 figure
☆ Turkronicles: Diachronic Resources for the Fast Evolving Turkish Language
Over the past century, the Turkish language has undergone substantial
changes, primarily driven by governmental interventions. In this work, our goal
is to investigate the evolution of the Turkish language since the establishment
of T\"urkiye in 1923. Thus, we first introduce Turkronicles which is a
diachronic corpus for Turkish derived from the Official Gazette of T\"urkiye.
Turkronicles contains 45,375 documents, detailing governmental actions, making
it a pivotal resource for analyzing the linguistic evolution influenced by the
state policies. In addition, we expand an existing diachronic Turkish corpus
which consists of the records of the Grand National Assembly of T\"urkiye by
covering additional years. Next, combining these two diachronic corpora, we
seek answers for two main research questions: How have the Turkish vocabulary
and the writing conventions changed since the 1920s? Our analysis reveals that
the vocabularies of two different time periods diverge more as the time between
them increases, and newly coined Turkish words take the place of their old
counterparts. We also observe changes in writing conventions. In particular,
the use of circumflex noticeably decreases and words ending with the letters
"-b" and "-d" are successively replaced with "-p" and "-t" letters,
respectively. Overall, this study quantitatively highlights the dramatic
changes in Turkish from various aspects of the language in a diachronic
perspective.
☆ StyloAI: Distinguishing AI-Generated Content with Stylometric Analysis
The emergence of large language models (LLMs) capable of generating realistic
texts and images has sparked ethical concerns across various sectors. In
response, researchers in academia and industry are actively exploring methods
to distinguish AI-generated content from human-authored material. However, a
crucial question remains: What are the unique characteristics of AI-generated
text? Addressing this gap, this study proposes StyloAI, a data-driven model
that uses 31 stylometric features to identify AI-generated texts by applying a
Random Forest classifier on two multi-domain datasets. StyloAI achieves
accuracy rates of 81% and 98% on the test set of the AuTextification dataset
and the Education dataset, respectively. This approach surpasses the
performance of existing state-of-the-art models and provides valuable insights
into the differences between AI-generated and human-authored texts.
comment: 25th International Conference on Artificial on Artificial
Intelligence in Education(AIED 2024)
☆ Red Teaming Language Models for Contradictory Dialogues
Most language models currently available are prone to self-contradiction
during dialogues. To mitigate this issue, this study explores a novel
contradictory dialogue processing task that aims to detect and modify
contradictory statements in a conversation. This task is inspired by research
on context faithfulness and dialogue comprehension, which have demonstrated
that the detection and understanding of contradictions often necessitate
detailed explanations. We develop a dataset comprising contradictory dialogues,
in which one side of the conversation contradicts itself. Each dialogue is
accompanied by an explanatory label that highlights the location and details of
the contradiction. With this dataset, we present a Red Teaming framework for
contradictory dialogue processing. The framework detects and attempts to
explain the dialogue, then modifies the existing contradictory content using
the explanation. Our experiments demonstrate that the framework improves the
ability to detect contradictory dialogues and provides valid explanations.
Additionally, it showcases distinct capabilities for modifying such dialogues.
Our study highlights the importance of the logical inconsistency problem in
conversational AI.
comment: 18 pages, 5 figures
☆ Distilling Implicit Multimodal Knowledge into LLMs for Zero-Resource Dialogue Generation
Integrating multimodal knowledge into large language models (LLMs) represents
a significant advancement in dialogue generation capabilities. However, the
effective incorporation of such knowledge in zero-resource scenarios remains a
substantial challenge due to the scarcity of diverse, high-quality dialogue
datasets. To address this, we propose the Visual Implicit Knowledge
Distillation Framework (VIKDF), an innovative approach aimed at enhancing LLMs
for enriched dialogue generation in zero-resource contexts by leveraging
implicit multimodal knowledge. VIKDF comprises two main stages: knowledge
distillation, using an Implicit Query Transformer to extract and encode visual
implicit knowledge from image-text pairs into knowledge vectors; and knowledge
integration, employing a novel Bidirectional Variational Information Fusion
technique to seamlessly integrate these distilled vectors into LLMs. This
enables the LLMs to generate dialogues that are not only coherent and engaging
but also exhibit a deep understanding of the context through implicit
multimodal cues, effectively overcoming the limitations of zero-resource
scenarios. Our extensive experimentation across two dialogue datasets shows
that VIKDF outperforms existing state-of-the-art models in generating
high-quality dialogues. The code will be publicly available following
acceptance.
comment: Under Review
☆ MarkLLM: An Open-Source Toolkit for LLM Watermarking
Leyi Pan, Aiwei Liu, Zhiwei He, Zitian Gao, Xuandong Zhao, Yijian Lu, Binglin Zhou, Shuliang Liu, Xuming Hu, Lijie Wen, Irwin King
LLM watermarking, which embeds imperceptible yet algorithmically detectable
signals in model outputs to identify LLM-generated text, has become crucial in
mitigating the potential misuse of large language models. However, the
abundance of LLM watermarking algorithms, their intricate mechanisms, and the
complex evaluation procedures and perspectives pose challenges for researchers
and the community to easily experiment with, understand, and assess the latest
advancements. To address these issues, we introduce MarkLLM, an open-source
toolkit for LLM watermarking. MarkLLM offers a unified and extensible framework
for implementing LLM watermarking algorithms, while providing user-friendly
interfaces to ensure ease of access. Furthermore, it enhances understanding by
supporting automatic visualization of the underlying mechanisms of these
algorithms. For evaluation, MarkLLM offers a comprehensive suite of 12 tools
spanning three perspectives, along with two types of automated evaluation
pipelines. Through MarkLLM, we aim to support researchers while improving the
comprehension and involvement of the general public in LLM watermarking
technology, fostering consensus and driving further advancements in research
and application. Our code is available at https://github.com/THU-BPM/MarkLLM.
comment: 16 pages, 5 figures, 6 tables
☆ SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation
Large language models (LLMs) are versatile and can address many tasks, but
for computational efficiency, it is often desirable to distill their
capabilities into smaller student models. One way to do this for classification
tasks is via dataset synthesis, which can be accomplished by generating
examples of each label from the LLM. Prior approaches to synthesis use few-shot
prompting, which relies on the LLM's parametric knowledge to generate usable
examples. However, this leads to issues of repetition, bias towards popular
entities, and stylistic differences from human text. In this work, we propose
Synthesize by Retrieval and Refinement (SynthesizRR), which uses retrieval
augmentation to introduce variety into the dataset synthesis process: as
retrieved passages vary, the LLM is "seeded" with different content to generate
its examples. We empirically study the synthesis of six datasets, covering
topic classification, sentiment analysis, tone detection, and humor, requiring
complex synthesis strategies. We find SynthesizRR greatly improves lexical and
semantic diversity, similarity to human-written text, and distillation
performance, when compared to standard 32-shot prompting and six baseline
approaches.
☆ Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models ACL 2024
Recent advances in large language models (LLMs) have promoted generative
error correction (GER) for automatic speech recognition (ASR), which aims to
predict the ground-truth transcription from the decoded N-best hypotheses.
Thanks to the strong language generation ability of LLMs and rich information
in the N-best list, GER shows great effectiveness in enhancing ASR results.
However, it still suffers from two limitations: 1) LLMs are unaware of the
source speech during GER, which may lead to results that are grammatically
correct but violate the source speech content, 2) N-best hypotheses usually
only vary in a few tokens, making it redundant to send all of them for GER,
which could confuse LLM about which tokens to focus on and thus lead to
increased miscorrection. In this paper, we propose ClozeGER, a new paradigm for
ASR generative error correction. First, we introduce a multimodal LLM (i.e.,
SpeechGPT) to receive source speech as extra input to improve the fidelity of
correction output. Then, we reformat GER as a cloze test with logits
calibration to remove the input information redundancy and simplify GER with
clear instructions. Experiments show that ClozeGER achieves a new breakthrough
over vanilla GER on 9 popular ASR datasets.
comment: 14 pages, Accepted by ACL 2024
☆ Natural Language Can Help Bridge the Sim2Real Gap
The main challenge in learning image-conditioned robotic policies is
acquiring a visual representation conducive to low-level control. Due to the
high dimensionality of the image space, learning a good visual representation
requires a considerable amount of visual data. However, when learning in the
real world, data is expensive. Sim2Real is a promising paradigm for overcoming
data scarcity in the real-world target domain by using a simulator to collect
large amounts of cheap data closely related to the target task. However, it is
difficult to transfer an image-conditioned policy from sim to real when the
domains are very visually dissimilar. To bridge the sim2real visual gap, we
propose using natural language descriptions of images as a unifying signal
across domains that captures the underlying task-relevant semantics. Our key
insight is that if two image observations from different domains are labeled
with similar language, the policy should predict similar action distributions
for both images. We demonstrate that training the image encoder to predict the
language description or the distance between descriptions of a sim or real
image serves as a useful, data-efficient pretraining step that helps learn a
domain-invariant image representation. We can then use this image encoder as
the backbone of an IL policy trained simultaneously on a large amount of
simulated and a handful of real demonstrations. Our approach outperforms widely
used prior sim2real methods and strong vision-language pretraining baselines
like CLIP and R3M by 25 to 40%.
comment: To appear in RSS 2024
☆ Zero-Shot Hierarchical Classification on the Common Procurement Vocabulary Taxonomy
Classifying public tenders is a useful task for both companies that are
invited to participate and for inspecting fraudulent activities. To facilitate
the task for both participants and public administrations, the European Union
presented a common taxonomy (\textit{Common Procurement Vocabulary}, CPV) which
is mandatory for tenders of certain importance; however, the contracts in which
a CPV label is mandatory are the minority compared to all the Public
Administrations activities. Classifying over a real-world taxonomy introduces
some difficulties that can not be ignored. First of all, some fine-grained
classes have an insufficient (if any) number of observations in the training
set, while other classes are far more frequent (even thousands of times) than
the average. To overcome those difficulties, we present a zero-shot approach,
based on a pre-trained language model that relies only on label description and
respects the label taxonomy. To train our proposed model, we used industrial
data, which comes from \url{contrattipubblici.org}, a service by
\href{https://spaziodati.eu}{SpazioDati s.r.l}. that collects public contracts
stipulated in Italy in the last 25 years. Results show that the proposed model
achieves better performance in classifying low-frequent classes compared to
three different baselines, and is also able to predict never-seen classes.
comment: Full-length version of the short paper accepted at COMPSAC 2024
☆ FinTextQA: A Dataset for Long-form Financial Question Answering
Accurate evaluation of financial question answering (QA) systems necessitates
a comprehensive dataset encompassing diverse question types and contexts.
However, current financial QA datasets lack scope diversity and question
complexity. This work introduces FinTextQA, a novel dataset for long-form
question answering (LFQA) in finance. FinTextQA comprises 1,262 high-quality,
source-attributed QA pairs extracted and selected from finance textbooks and
government agency websites.Moreover, we developed a Retrieval-Augmented
Generation (RAG)-based LFQA system, comprising an embedder, retriever,
reranker, and generator. A multi-faceted evaluation approach, including human
ranking, automatic metrics, and GPT-4 scoring, was employed to benchmark the
performance of different LFQA system configurations under heightened noisy
conditions. The results indicate that: (1) Among all compared generators,
Baichuan2-7B competes closely with GPT-3.5-turbo in accuracy score; (2) The
most effective system configuration on our dataset involved setting the
embedder, retriever, reranker, and generator as Ada2, Automated Merged
Retrieval, Bge-Reranker-Base, and Baichuan2-7B, respectively; (3) models are
less susceptible to noise after the length of contexts reaching a specific
threshold.
☆ Mitigating Text Toxicity with Counterfactual Generation
Milan Bhan, Jean-Noel Vittaut, Nina Achache, Victor Legrand, Nicolas Chesneau, Annabelle Blangero, Juliette Murris, Marie-Jeanne Lesot
Toxicity mitigation consists in rephrasing text in order to remove offensive
or harmful meaning. Neural natural language processing (NLP) models have been
widely used to target and mitigate textual toxicity. However, existing methods
fail to detoxify text while preserving the initial non-toxic meaning at the
same time. In this work, we propose to apply counterfactual generation methods
from the eXplainable AI (XAI) field to target and mitigate textual toxicity. In
particular, we perform text detoxification by applying local feature importance
and counterfactual generation methods to a toxicity classifier distinguishing
between toxic and non-toxic texts. We carry out text detoxification through
counterfactual generation on three datasets and compare our approach to three
competitors. Automatic and human evaluations show that recently developed NLP
counterfactual generators can mitigate toxicity accurately while better
preserving the meaning of the initial text as compared to classical
detoxification methods. Finally, we take a step back from using automated
detoxification tools, and discuss how to manage the polysemous nature of
toxicity and the risk of malicious use of detoxification tools. This work is
the first to bridge the gap between counterfactual generation and text
detoxification and paves the way towards more practical application of XAI
methods.
☆ SciQAG: A Framework for Auto-Generated Scientific Question Answering Dataset with Fine-grained Evaluation
Yuwei Wan, Aswathy Ajith, Yixuan Liu, Ke Lu, Clara Grazian, Bram Hoex, Wenjie Zhang, Chunyu Kit, Tong Xie, Ian Foster
The use of question-answer (QA) pairs for training and evaluating large
language models (LLMs) has attracted considerable attention. Yet few available
QA datasets are based on knowledge from the scientific literature. Here we
bridge this gap by presenting Automatic Generation of Scientific Question
Answers (SciQAG), a framework for automatic generation and evaluation of
scientific QA pairs sourced from published scientific literature. We fine-tune
an open-source LLM to generate \num{960000} scientific QA pairs from full-text
scientific papers and propose a five-dimensional metric to evaluate the quality
of the generated QA pairs. We show via LLM-based evaluation that the generated
QA pairs consistently achieve an average score of 2.5 out of 3 across five
dimensions, indicating that our framework can distill key knowledge from papers
into high-quality QA pairs at scale. We make the dataset, models, and
evaluation codes publicly available.
☆ DEBATE: Devil's Advocate-Based Assessment and Text Evaluation
As natural language generation (NLG) models have become prevalent,
systematically assessing the quality of machine-generated texts has become
increasingly important. Recent studies introduce LLM-based evaluators that
operate as reference-free metrics, demonstrating their capability to adeptly
handle novel tasks. However, these models generally rely on a single-agent
approach, which, we argue, introduces an inherent limit to their performance.
This is because there exist biases in LLM agent's responses, including
preferences for certain text structure or content. In this work, we propose
DEBATE, an NLG evaluation framework based on multi-agent scoring system
augmented with a concept of Devil's Advocate. Within the framework, one agent
is instructed to criticize other agents' arguments, potentially resolving the
bias in LLM agent's answers. DEBATE substantially outperforms the previous
state-of-the-art methods in two meta-evaluation benchmarks in NLG evaluation,
SummEval and TopicalChat. We also show that the extensiveness of debates among
agents and the persona of an agent can influence the performance of evaluators.
☆ TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data
Transliterating related languages that use different scripts into a common
script shows effectiveness in improving crosslingual transfer in downstream
tasks. However, this methodology often makes pretraining a model from scratch
unavoidable, as transliteration brings about new subwords not covered in
existing multilingual pretrained language models (mPLMs). This is not desired
because it takes a lot of computation budget for pretraining. A more promising
way is to make full use of available mPLMs. To this end, this paper proposes a
simple but effective framework: Transliterate-Merge-Initialize (TransMI), which
can create a strong baseline well-suited for data that is transliterated into a
common script by exploiting an mPLM and its accompanied tokenizer. TransMI has
three stages: (a) transliterate the vocabulary of an mPLM into a common script;
(b) merge the new vocabulary with the original vocabulary; and (c) initialize
the embeddings of the new subwords. We applied TransMI to three recent strong
mPLMs, and our experiments demonstrate that TransMI not only preserves their
ability to handle non-transliterated data, but also enables the models to
effectively process transliterated data: the results show a consistent
improvement of 3% to 34%, varying across different models and tasks. We make
our code and models publicly available at
\url{https://github.com/cisnlp/TransMI}.
comment: preprint
☆ "Hunt Takes Hare": Theming Games Through Game-Word Vector Translation
A game's theme is an important part of its design -- it conveys narrative
information, rhetorical messages, helps the player intuit strategies, aids in
tutorialisation and more. Thematic elements of games are notoriously difficult
for AI systems to understand and manipulate, however, and often rely on large
amounts of hand-written interpretations and knowledge. In this paper we present
a technique which connects game embeddings, a recent method for modelling game
dynamics from log data, and word embeddings, which models semantic information
about language. We explain two different approaches for using game embeddings
in this way, and show evidence that game embeddings enhance the linguistic
translations of game concepts from one theme to another, opening up exciting
new possibilities for reasoning about the thematic elements of games in the
future.
comment: 7 pages, PCG Workshop at FDG 2024
☆ IGOT: Information Gain Optimized Tokenizer on Domain Adaptive Pretraining
Pretrained Large Language Models (LLM) such as ChatGPT, Claude, etc. have
demonstrated strong capabilities in various fields of natural language
generation. However, there are still many problems when using LLM in
specialized domain-specific fields. When using generative AI to process
downstream tasks, a common approach is to add new knowledge (e.g., private
domain knowledge, cutting-edge information) to a pretrained model through
continued training or fine-tuning. However, whether there is a universal
paradigm for domain adaptation training is still an open question. In this
article, we proposed Information Gain Optimized Tokenizer (IGOT), which
analyzes the special token set of downstream tasks, constructs a new subset
using heuristic function $\phi$ with the special token and its information
gain, to build new domain-specific tokenizer, and continues pretraining on the
downstream task data. We explored the many positive effects of this method's
customized tokenizer on domain-adaptive pretraining and verified this method
can perform better than the ordinary method of just collecting data and
fine-tuning. Based on our experiment, the continued pretraining process of IGOT
with LLaMA-7B achieved 11.9\% token saving, 12.2\% training time saving, and
5.8\% maximum GPU VRAM usage saving, combined with the T5 model, we can even
reach a 31.5\% of training time saving, making porting general generative AI to
specific domains more effective than before. In domain-specific tasks,
supervised $IGOT_\tau$ shows great performance on reducing both the convergence
radius and convergence point during keep pretraining.
☆ On the relevance of pre-neural approaches in natural language processing pedagogy ACL 2024
While neural approaches using deep learning are the state-of-the-art for
natural language processing (NLP) today, pre-neural algorithms and approaches
still find a place in NLP textbooks and courses of recent years. In this paper,
we compare two introductory NLP courses taught in Australia and India, and
examine how Transformer and pre-neural approaches are balanced within the
lecture plan and assessments of the courses. We also draw parallels with the
objects-first and objects-later debate in CS1 education. We observe that
pre-neural approaches add value to student learning by building an intuitive
understanding of NLP problems, potential solutions and even Transformer-based
models themselves. Despite pre-neural approaches not being state-of-the-art,
the paper makes a case for their inclusion in NLP courses today.
comment: Under review at Teaching NLP workshop at ACL 2024; 8 pages
☆ Enhancing Semantics in Multimodal Chain of Thought via Soft Negative Sampling LREC
Chain of thought (CoT) has proven useful for problems requiring complex
reasoning. Many of these problems are both textual and multimodal. Given the
inputs in different modalities, a model generates a rationale and then uses it
to answer a question. Because of the hallucination issue, the generated soft
negative rationales with high textual quality but illogical semantics do not
always help improve answer accuracy. This study proposes a rationale generation
method using soft negative sampling (SNSE-CoT) to mitigate hallucinations in
multimodal CoT. Five methods were applied to generate soft negative samples
that shared highly similar text but had different semantics from the original.
Bidirectional margin loss (BML) was applied to introduce them into the
traditional contrastive learning framework that involves only positive and
negative samples. Extensive experiments on the ScienceQA dataset demonstrated
the effectiveness of the proposed method. Code and data are released at
https://github.com/zgMin/SNSE-CoT.
comment: Accepted by LREC-COLING 2024
☆ Chameleon: Mixed-Modal Early-Fusion Foundation Models
We present Chameleon, a family of early-fusion token-based mixed-modal models
capable of understanding and generating images and text in any arbitrary
sequence. We outline a stable training approach from inception, an alignment
recipe, and an architectural parameterization tailored for the early-fusion,
token-based, mixed-modal setting. The models are evaluated on a comprehensive
range of tasks, including visual question answering, image captioning, text
generation, image generation, and long-form mixed modal generation. Chameleon
demonstrates broad and general capabilities, including state-of-the-art
performance in image captioning tasks, outperforms Llama-2 in text-only tasks
while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and
performs non-trivial image generation, all in a single model. It also matches
or exceeds the performance of much larger models, including Gemini Pro and
GPT-4V, according to human judgments on a new long-form mixed-modal generation
evaluation, where either the prompt or outputs contain mixed sequences of both
images and text. Chameleon marks a significant step forward in a unified
modeling of full multimodal documents.
☆ MediSyn: Text-Guided Diffusion Models for Broad Medical 2D and 3D Image Synthesis
Diffusion models have recently gained significant traction due to their
ability to generate high-fidelity and diverse images and videos conditioned on
text prompts. In medicine, this application promises to address the critical
challenge of data scarcity, a consequence of barriers in data sharing,
stringent patient privacy regulations, and disparities in patient population
and demographics. By generating realistic and varying medical 2D and 3D images,
these models offer a rich, privacy-respecting resource for algorithmic training
and research. To this end, we introduce MediSyn, a pair of instruction-tuned
text-guided latent diffusion models with the ability to generate high-fidelity
and diverse medical 2D and 3D images across specialties and modalities. Through
established metrics, we show significant improvement in broad medical image and
video synthesis guided by text prompts.
☆ SecureLLM: Using Compositionality to Build Provably Secure Language Models for Private, Sensitive, and Secret Data
Abdulrahman Alabdulakreem, Christian M Arnold, Yerim Lee, Pieter M Feenstra, Boris Katz, Andrei Barbu
Traditional security mechanisms isolate resources from users who should not
access them. We reflect the compositional nature of such security mechanisms
back into the structure of LLMs to build a provably secure LLM; that we term
SecureLLM. Other approaches to LLM safety attempt to protect against bad actors
or bad outcomes, but can only do so to an extent making them inappropriate for
sensitive data. SecureLLM blends access security with fine-tuning methods. Each
data silo has associated with it a separate fine-tuning and a user has access
only to the collection of fine-tunings that they have permission for. The model
must then perform on compositional tasks at the intersection of those data
silos with the combination of those individual fine-tunings. While applicable
to any task like document QA or making API calls, in this work we concern
ourselves with models that learn the layouts of new SQL databases to provide
natural-language-to-SQL translation capabilities. Existing fine-tuning
composition methods fail in this challenging environment, as they are not
well-equipped for handling compositional tasks. Compositionality remains a
challenge for LLMs. We contribute both a difficult new compositional
natural-language-to-SQL translation task and a new perspective on LLM security
that allows models to be deployed to secure environments today.
☆ Many-Shot In-Context Learning in Multimodal Foundation Models
Large language models are well-known to be effective at few-shot in-context
learning (ICL). Recent advancements in multimodal foundation models have
enabled unprecedentedly long context windows, presenting an opportunity to
explore their capability to perform ICL with many more demonstrating examples.
In this work, we evaluate the performance of multimodal foundation models
scaling from few-shot to many-shot ICL. We benchmark GPT-4o and Gemini 1.5 Pro
across 10 datasets spanning multiple domains (natural imagery, medical imagery,
remote sensing, and molecular imagery) and tasks (multi-class, multi-label, and
fine-grained classification). We observe that many-shot ICL, including up to
almost 2,000 multimodal demonstrating examples, leads to substantial
improvements compared to few-shot (<100 examples) ICL across all of the
datasets. Further, Gemini 1.5 Pro performance continues to improve log-linearly
up to the maximum number of tested examples on many datasets. Given the high
inference costs associated with the long prompts required for many-shot ICL, we
also explore the impact of batching multiple queries in a single API call. We
show that batching up to 50 queries can lead to performance improvements under
zero-shot and many-shot ICL, with substantial gains in the zero-shot setting on
multiple datasets, while drastically reducing per-query cost and latency.
Finally, we measure ICL data efficiency of the models, or the rate at which the
models learn from more demonstrating examples. We find that while GPT-4o and
Gemini 1.5 Pro achieve similar zero-shot performance across the datasets,
Gemini 1.5 Pro exhibits higher ICL data efficiency than GPT-4o on most
datasets. Our results suggest that many-shot ICL could enable users to
efficiently adapt multimodal foundation models to new applications and domains.
Our codebase is publicly available at
https://github.com/stanfordmlgroup/ManyICL .
☆ Optimization Techniques for Sentiment Analysis Based on LLM (GPT-3)
With the rapid development of natural language processing (NLP) technology,
large-scale pre-trained language models such as GPT-3 have become a popular
research object in NLP field. This paper aims to explore sentiment analysis
optimization techniques based on large pre-trained language models such as
GPT-3 to improve model performance and effect and further promote the
development of natural language processing (NLP). By introducing the importance
of sentiment analysis and the limitations of traditional methods, GPT-3 and
Fine-tuning techniques are introduced in this paper, and their applications in
sentiment analysis are explained in detail. The experimental results show that
the Fine-tuning technique can optimize GPT-3 model and obtain good performance
in sentiment analysis task. This study provides an important reference for
future sentiment analysis using large-scale language models.
☆ Unsupervised Extractive Dialogue Summarization in Hyperdimensional Space ICASSP 2024
We present HyperSum, an extractive summarization framework that captures both
the efficiency of traditional lexical summarization and the accuracy of
contemporary neural approaches. HyperSum exploits the pseudo-orthogonality that
emerges when randomly initializing vectors at extremely high dimensions
("blessing of dimensionality") to construct representative and efficient
sentence embeddings. Simply clustering the obtained embeddings and extracting
their medoids yields competitive summaries. HyperSum often outperforms
state-of-the-art summarizers -- in terms of both summary accuracy and
faithfulness -- while being 10 to 100 times faster. We open-source HyperSum as
a strong baseline for unsupervised extractive summarization.
comment: ICASSP 2024
☆ Many Hands Make Light Work: Task-Oriented Dialogue System with Module-Based Mixture-of-Experts
Task-oriented dialogue systems are broadly used in virtual assistants and
other automated services, providing interfaces between users and machines to
facilitate specific tasks. Nowadays, task-oriented dialogue systems have
greatly benefited from pre-trained language models (PLMs). However, their
task-solving performance is constrained by the inherent capacities of PLMs, and
scaling these models is expensive and complex as the model size becomes larger.
To address these challenges, we propose Soft Mixture-of-Expert Task-Oriented
Dialogue system (SMETOD) which leverages an ensemble of Mixture-of-Experts
(MoEs) to excel at subproblems and generate specialized outputs for
task-oriented dialogues. SMETOD also scales up a task-oriented dialogue system
with simplicity and flexibility while maintaining inference efficiency. We
extensively evaluate our model on three benchmark functionalities: intent
prediction, dialogue state tracking, and dialogue response generation.
Experimental results demonstrate that SMETOD achieves state-of-the-art
performance on most evaluated metrics. Moreover, comparisons against existing
strong baselines show that SMETOD has a great advantage in the cost of
inference and correctness in problem-solving.
☆ An Analysis of Sentential Neighbors in Implicit Discourse Relation Prediction
Discourse relation classification is an especially difficult task without
explicit context markers \cite{Prasad2008ThePD}. Current approaches to implicit
relation prediction solely rely on two neighboring sentences being targeted,
ignoring the broader context of their surrounding environments
\cite{Atwell2021WhereAW}. In this research, we propose three new methods in
which to incorporate context in the task of sentence relation prediction: (1)
Direct Neighbors (DNs), (2) Expanded Window Neighbors (EWNs), and (3)
Part-Smart Random Neighbors (PSRNs). Our findings indicate that the inclusion
of context beyond one discourse unit is harmful in the task of discourse
relation classification.
♻ ☆ OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs
Mihai Masala, Denis C. Ilie-Ablachim, Dragos Corlatescu, Miruna Zavelca, Marius Leordeanu, Horia Velicu, Marius Popescu, Mihai Dascalu, Traian Rebedea
In recent years, Large Language Models (LLMs) have achieved almost human-like
performance on various tasks. While some LLMs have been trained on multilingual
data, most of the training data is in English. Hence, their performance in
English greatly exceeds their performance in other languages. This document
presents our approach to training and evaluating the first foundational and
chat LLM specialized for Romanian.
♻ ☆ DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model
characterized by economical training and efficient inference. It comprises 236B
total parameters, of which 21B are activated for each token, and supports a
context length of 128K tokens. DeepSeek-V2 adopts innovative architectures
including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees
efficient inference through significantly compressing the Key-Value (KV) cache
into a latent vector, while DeepSeekMoE enables training strong models at an
economical cost through sparse computation. Compared with DeepSeek 67B,
DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves
42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum
generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality
and multi-source corpus consisting of 8.1T tokens, and further perform
Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock
its potential. Evaluation results show that, even with only 21B activated
parameters, DeepSeek-V2 and its chat versions still achieve top-tier
performance among open-source models.
♻ ☆ LLM-Assisted Rule Based Machine Translation for Low/No-Resource Languages
We propose a new paradigm for machine translation that is particularly useful
for no-resource languages (those without any publicly available bilingual or
monolingual corpora): LLM-RBMT (LLM-Assisted Rule Based Machine Translation).
Using the LLM-RBMT paradigm, we design the first language
education/revitalization-oriented machine translator for Owens Valley Paiute
(OVP), a critically endangered Indigenous American language for which there is
virtually no publicly available data. We present a detailed evaluation of the
translator's components: a rule-based sentence builder, an OVP to English
translator, and an English to OVP translator. We also discuss the potential of
the paradigm, its limitations, and the many avenues for future research that it
opens up.
♻ ☆ A Modular Approach for Multimodal Summarization of TV Shows
In this paper we address the task of summarizing television shows, which
touches key areas in AI research: complex reasoning, multiple modalities, and
long narratives. We present a modular approach where separate components
perform specialized sub-tasks which we argue affords greater flexibility
compared to end-to-end methods. Our modules involve detecting scene boundaries,
reordering scenes so as to minimize the number of cuts between different
events, converting visual information to text, summarizing the dialogue in each
scene, and fusing the scene summaries into a final summary for the entire
episode. We also present a new metric, PREFS (Precision and Recall Evaluation
of Summary FactS), to measure both precision and recall of generated summaries,
which we decompose into atomic facts. Tested on the recently released
SummScreen3D dataset Papalampidi and Lapata (2023), our method produces higher
quality summaries than comparison models, as measured with ROUGE and our new
fact-based metric.
♻ ☆ Building Knowledge-Grounded Dialogue Systems with Graph-Based Semantic Modeling
The knowledge-grounded dialogue task aims to generate responses that convey
information from given knowledge documents. However, it is a challenge for the
current sequence-based model to acquire knowledge from complex documents and
integrate it to perform correct responses without the aid of an explicit
semantic structure. To address these issues, we propose a novel graph
structure, Grounded Graph ($G^2$), that models the semantic structure of both
dialogue and knowledge to facilitate knowledge selection and integration for
knowledge-grounded dialogue generation. We also propose a Grounded Graph Aware
Transformer ($G^2AT$) model that fuses multi-forms knowledge (both sequential
and graphic) to enhance knowledge-grounded response generation. Our experiments
results show that our proposed model outperforms the previous state-of-the-art
methods with more than 10\% gains in response generation and nearly 20\%
improvement in factual consistency. Further, our model reveals good
generalization ability and robustness. By incorporating semantic structures as
prior knowledge in deep neural networks, our model provides an effective way to
aid language generation.
♻ ☆ TRABSA: Interpretable Sentiment Analysis of Tweets using Attention-based BiLSTM and Twitter-RoBERTa
Sentiment analysis is crucial for understanding public opinion and consumer
behavior. Existing models face challenges with linguistic diversity,
generalizability, and explainability. We propose TRABSA, a hybrid framework
integrating transformer-based architectures, attention mechanisms, and BiLSTM
networks to address this. Leveraging RoBERTa-trained on 124M tweets, we bridge
gaps in sentiment analysis benchmarks, ensuring state-of-the-art accuracy.
Augmenting datasets with tweets from 32 countries and US states, we compare six
word-embedding techniques and three lexicon-based labeling techniques,
selecting the best for optimal sentiment analysis. TRABSA outperforms
traditional ML and deep learning models with 94% accuracy and significant
precision, recall, and F1-score gains. Evaluation across diverse datasets
demonstrates consistent superiority and generalizability. SHAP and LIME
analyses enhance interpretability, improving confidence in predictions. Our
study facilitates pandemic resource management, aiding resource planning,
policy formation, and vaccination tactics.
♻ ☆ Benchmarking Retrieval-Augmented Large Language Models in Biomedical NLP: Application, Robustness, and Self-Awareness
Large language models (LLM) have demonstrated remarkable capabilities in
various biomedical natural language processing (NLP) tasks, leveraging the
demonstration within the input context to adapt to new tasks. However, LLM is
sensitive to the selection of demonstrations. To address the hallucination
issue inherent in LLM, retrieval-augmented LLM (RAL) offers a solution by
retrieving pertinent information from an established database. Nonetheless,
existing research work lacks rigorous evaluation of the impact of
retrieval-augmented large language models on different biomedical NLP tasks.
This deficiency makes it challenging to ascertain the capabilities of RAL
within the biomedical domain. Moreover, the outputs from RAL are affected by
retrieving the unlabeled, counterfactual, or diverse knowledge that is not well
studied in the biomedical domain. However, such knowledge is common in the real
world. Finally, exploring the self-awareness ability is also crucial for the
RAL system. So, in this paper, we systematically investigate the impact of RALs
on 5 different biomedical tasks (triple extraction, link prediction,
classification, question answering, and natural language inference). We analyze
the performance of RALs in four fundamental abilities, including unlabeled
robustness, counterfactual robustness, diverse robustness, and negative
awareness. To this end, we proposed an evaluation framework to assess the RALs'
performance on different biomedical NLP tasks and establish four different
testbeds based on the aforementioned fundamental abilities. Then, we evaluate 3
representative LLMs with 3 different retrievers on 5 tasks over 9 datasets.
♻ ☆ Self-Explore to Avoid the Pit: Improving the Reasoning Capabilities of Language Models with Fine-grained Rewards
Training on large amounts of rationales (i.e., CoT Fine-tuning) is effective
at improving the reasoning capabilities of large language models (LLMs).
However, acquiring human-authored rationales or augmenting rationales from
proprietary models is costly and not scalable. In this paper, we study the
problem of whether LLMs could self-improve their reasoning capabilities. To
this end, we propose Self-Explore, where the LLM is tasked to explore the first
wrong step (i.e., the first pit) within the rationale and use such signals as
fine-grained rewards for further improvement. On the GSM8K and MATH test set,
Self-Explore achieves 11.57% and 2.89% improvement on average across three LLMs
compared to supervised fine-tuning (SFT). Our code is available at
https://github.com/hbin0701/Self-Explore.
comment: Preprint Under Review
♻ ☆ Escaping the sentence-level paradigm in machine translation
It is well-known that document context is vital for resolving a range of
translation ambiguities, and in fact the document setting is the most natural
setting for nearly all translation. It is therefore unfortunate that machine
translation -- both research and production -- largely remains stuck in a
decades-old sentence-level translation paradigm. It is also an increasingly
glaring problem in light of competitive pressure from large language models,
which are natively document-based. Much work in document-context machine
translation exists, but for various reasons has been unable to catch hold. This
paper suggests a path out of this rut by addressing three impediments at once:
what architectures should we use? where do we get document-level information
for training them? and how do we know whether they are any good? In contrast to
work on specialized architectures, we show that the standard Transformer
architecture is sufficient, provided it has enough capacity. Next, we address
the training data issue by taking document samples from back-translated data
only, where the data is not only more readily available, but is also of higher
quality compared to parallel document data, which may contain machine
translation output. Finally, we propose generative variants of existing
contrastive metrics that are better able to discriminate among document
systems. Results in four large-data language pairs (DE$\rightarrow$EN,
EN$\rightarrow$DE, EN$\rightarrow$FR, and EN$\rightarrow$RU) establish the
success of these three pieces together in improving document-level performance.
♻ ☆ Protecting Your LLMs with Information Bottleneck
Zichuan Liu, Zefan Wang, Linjie Xu, Jinyu Wang, Lei Song, Tianchun Wang, Chunlin Chen, Wei Cheng, Jiang Bian
The advent of large language models (LLMs) has revolutionized the field of
natural language processing, yet they might be attacked to produce harmful
content. Despite efforts to ethically align LLMs, these are often fragile and
can be circumvented by jailbreaking attacks through optimized or manual
adversarial prompts. To address this, we introduce the Information Bottleneck
Protector (IBProtector), a defense mechanism grounded in the information
bottleneck principle, and we modify the objective to avoid trivial solutions.
The IBProtector selectively compresses and perturbs prompts, facilitated by a
lightweight and trainable extractor, preserving only essential information for
the target LLMs to respond with the expected answer. Moreover, we further
consider a situation where the gradient is not visible to be compatible with
any LLM. Our empirical evaluations show that IBProtector outperforms current
defense methods in mitigating jailbreak attempts, without overly affecting
response quality or inference speed. Its effectiveness and adaptability across
various attack methods and target LLMs underscore the potential of IBProtector
as a novel, transferable defense that bolsters the security of LLMs without
requiring modifications to the underlying models.
comment: 23 pages, 7 figures, 8 tables
♻ ☆ GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators ACL 2024
Recent advances in large language models (LLMs) have stepped forward the
development of multilingual speech and machine translation by its reduced
representation errors and incorporated external knowledge. However, both
translation tasks typically utilize beam search decoding and top-1 hypothesis
selection for inference. These techniques struggle to fully exploit the rich
information in the diverse N-best hypotheses, making them less optimal for
translation tasks that require a single, high-quality output sequence. In this
paper, we propose a new generative paradigm for translation tasks, namely
"GenTranslate", which builds upon LLMs to generate better results from the
diverse translation versions in N-best list. Leveraging the rich linguistic
knowledge and strong reasoning abilities of LLMs, our new paradigm can
integrate the rich information in N-best candidates to generate a
higher-quality translation result. Furthermore, to support LLM finetuning, we
build and release a HypoTranslate dataset that contains over 592K
hypotheses-translation pairs in 11 languages. Experiments on various speech and
machine translation benchmarks (e.g., FLEURS, CoVoST-2, WMT) demonstrate that
our GenTranslate significantly outperforms the state-of-the-art model.
comment: 18 pages, Accepted by ACL 2024. This work is open sourced at:
https://github.com/YUCHEN005/GenTranslate
♻ ☆ A blind spot for large language models: Supradiegetic linguistic information
Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer, Jonathan St. Onge, Mikaela Fudolig, Milo Z. Trujillo, Christopher M. Danforth, Peter Sheridan Dodds
Large Language Models (LLMs) like ChatGPT reflect profound changes in the
field of Artificial Intelligence, achieving a linguistic fluency that is
impressively, even shockingly, human-like. The extent of their current and
potential capabilities is an active area of investigation by no means limited
to scientific researchers. It is common for people to frame the training data
for LLMs as "text" or even "language". We examine the details of this framing
using ideas from several areas, including linguistics, embodied cognition,
cognitive science, mathematics, and history. We propose that considering what
it is like to be an LLM like ChatGPT, as Nagel might have put it, can help us
gain insight into its capabilities in general, and in particular, that its
exposure to linguistic training data can be productively reframed as exposure
to the diegetic information encoded in language, and its deficits can be
reframed as ignorance of extradiegetic information, including supradiegetic
linguistic information. Supradiegetic linguistic information consists of those
arbitrary aspects of the physical form of language that are not derivable from
the one-dimensional relations of context -- frequency, adjacency, proximity,
co-occurrence -- that LLMs like ChatGPT have access to. Roughly speaking, the
diegetic portion of a word can be thought of as its function, its meaning, as
the information in a theoretical vector in a word embedding, while the
supradiegetic portion of the word can be thought of as its form, like the
shapes of its letters or the sounds of its syllables. We use these concepts to
investigate why LLMs like ChatGPT have trouble handling palindromes, the visual
characteristics of symbols, translating Sumerian cuneiform, and continuing
integer sequences.
comment: 21 pages, 6 figures, 3 tables. Accepted at IC2S2 2024. arXiv admin
note: text overlap with arXiv:2206.02608, arXiv:2303.12712, arXiv:2305.10601,
arXiv:2305.06424, arXiv:1908.08530 by other authors
♻ ☆ PACE: Improving Prompt with Actor-Critic Editing for Large Language Model ACL
Large language models (LLMs) have showcased remarkable potential across
various tasks by conditioning on prompts. However, the quality of different
human-written prompts leads to substantial discrepancies in LLMs' performance,
and improving prompts usually necessitates considerable human effort and
expertise. To this end, this paper proposes Prompt with Actor-Critic Editing
(PACE) for LLMs to enable automatic prompt editing. Drawing inspiration from
the actor-critic algorithm in reinforcement learning, PACE leverages LLMs as
the dual roles of actors and critics, conceptualizing prompt as a type of
policy. PACE refines prompt, taking into account the feedback from both actors
performing prompt and critics criticizing response. This process helps LLMs
better align prompt to a specific task, thanks to real responses and thinking
from LLMs. We conduct extensive experiments on 24 instruction induction tasks
and 21 big-bench tasks. Experimental results indicate that PACE elevates the
relative performance of medium/low-quality human-written prompts by up to 98\%,
which has comparable performance to high-quality human-written prompts.
Moreover, PACE also exhibits notable efficacy for prompt generation.
comment: Accepted to ACL
♻ ☆ Retrieval augmented text-to-SQL generation for epidemiological question answering using electronic health records
Electronic health records (EHR) and claims data are rich sources of
real-world data that reflect patient health status and healthcare utilization.
Querying these databases to answer epidemiological questions is challenging due
to the intricacy of medical terminology and the need for complex SQL queries.
Here, we introduce an end-to-end methodology that combines text-to-SQL
generation with retrieval augmented generation (RAG) to answer epidemiological
questions using EHR and claims data. We show that our approach, which
integrates a medical coding step into the text-to-SQL process, significantly
improves the performance over simple prompting. Our findings indicate that
although current language models are not yet sufficiently accurate for
unsupervised use, RAG offers a promising direction for improving their
capabilities, as shown in a realistic industry setting.
comment: 6 pages, 1 figure
♻ ☆ Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models ACL 2024
In the rapidly advancing field of artificial intelligence, the concept of
Red-Teaming or Jailbreaking large language models (LLMs) has emerged as a
crucial area of study. This approach is especially significant in terms of
assessing and enhancing the safety and robustness of these models. This paper
investigates the intricate consequences of such modifications through model
editing, uncovering a complex relationship between enhancing model accuracy and
preserving its ethical integrity. Our in-depth analysis reveals a striking
paradox: while injecting accurate information is crucial for model reliability,
it can paradoxically destabilize the model's foundational framework, resulting
in unpredictable and potentially unsafe behaviors. Additionally, we propose a
benchmark dataset NicheHazardQA to investigate this unsafe behavior both within
the same and cross topical domain. This aspect of our research sheds light on
how the edits, impact the model's safety metrics and guardrails. Our findings
show that model editing serves as a cost-effective tool for topical red-teaming
by methodically applying targeted edits and evaluating the resultant model
behavior.
comment: Accepted at ACL 2024
♻ ☆ Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models ACL
Recent statements about the impressive capabilities of large language models
(LLMs) are usually supported by evaluating on open-access benchmarks.
Considering the vast size and wide-ranging sources of LLMs' training data, it
could explicitly or implicitly include test data, leading to LLMs being more
susceptible to data contamination. However, due to the opacity of training
data, the black-box access of models, and the rapid growth of synthetic
training data, detecting and mitigating data contamination for LLMs faces
significant challenges. In this paper, we propose CDD, which stands for
Contamination Detection via output Distribution for LLMs. CDD necessitates only
the sampled texts to detect data contamination, by identifying the peakedness
of LLM's output distribution. To mitigate the impact of data contamination in
evaluation, we also present TED: Trustworthy Evaluation via output
Distribution, based on the correction of LLM's output distribution. To
facilitate this study, we introduce two benchmarks, i.e., DetCon and ComiEval,
for data contamination detection and contamination mitigation evaluation tasks.
Extensive experimental results show that CDD achieves the average relative
improvements of 21.8\%-30.2\% over other contamination detection approaches in
terms of Accuracy, F1 Score, and AUC metrics, and can effectively detect
contamination caused by the variants of test data. TED significantly mitigates
performance improvements up to 66.9\% attributed to data contamination across
24 settings and 21 contamination degrees. In real-world applications, we reveal
that ChatGPT exhibits a high potential to suffer from data contamination on
HumanEval benchmark.
comment: Accepted to ACL
♻ ☆ FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference
Retrieval-Augmented Language Modeling (RALM) by integrating large language
models (LLM) with relevant documents from an external corpus is a proven method
for enabling the LLM to generate information beyond the scope of its
pre-training corpus. Previous work utilizing retrieved content by simply
prepending it to the input poses a high runtime issue, which degrades the
inference efficiency of the LLMs because they fail to use the Key-Value (KV)
cache efficiently. In this paper, we propose FlashBack, a modular RALM designed
to improve the inference efficiency of RALM with appending context pattern
while maintaining decent performance after fine-tuning by Low-Rank Adaption.
FlashBack appends retrieved documents at the end of the context for efficiently
utilizing the KV cache instead of prepending them. And we introduce Marking
Token as two special prompt tokens for marking the boundary of the appending
context during fine-tuning. Our experiments on testing generation quality show
that FlashBack can remain decent generation quality in perplexity. And the
inference speed of FlashBack is up to $4\times$ faster than the prepending
counterpart on a 7B LLM (Llama 2) in the runtime test. Via bypassing
unnecessary re-computation, it demonstrates an advancement by achieving
significantly faster inference speed, and this heightened efficiency will
substantially reduce inferential cost.
comment: 14 pages
♻ ☆ E2TP: Element to Tuple Prompting Improves Aspect Sentiment Tuple Prediction
Generative approaches have significantly influenced Aspect-Based Sentiment
Analysis (ABSA), garnering considerable attention. However, existing studies
often predict target text components monolithically, neglecting the benefits of
utilizing single elements for tuple prediction. In this paper, we introduce
Element to Tuple Prompting (E2TP), employing a two-step architecture. The
former step focuses on predicting single elements, while the latter step
completes the process by mapping these predicted elements to their
corresponding tuples. E2TP is inspired by human problem-solving, breaking down
tasks into manageable parts, using the first step's output as a guide in the
second step. Within this strategy, three types of paradigms, namely
E2TP($diet$), E2TP($f_1$), and E2TP($f_2$), are designed to facilitate the
training process. Beyond dataset-specific experiments, our paper addresses
cross-domain scenarios, demonstrating the effectiveness and generalizability of
the approach. By conducting a comprehensive analysis on various benchmarks, we
show that E2TP achieves new state-of-the-art results in nearly all cases.
♻ ☆ BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language
The BEIR dataset is a large, heterogeneous benchmark for Information
Retrieval (IR) in zero-shot settings, garnering considerable attention within
the research community. However, BEIR and analogous datasets are predominantly
restricted to the English language. Our objective is to establish extensive
large-scale resources for IR in the Polish language, thereby advancing the
research in this NLP area. In this work, inspired by mMARCO and Mr.~TyDi
datasets, we translated all accessible open IR datasets into Polish, and we
introduced the BEIR-PL benchmark -- a new benchmark which comprises 13
datasets, facilitating further development, training and evaluation of modern
Polish language models for IR tasks. We executed an evaluation and comparison
of numerous IR models on the newly introduced BEIR-PL benchmark. Furthermore,
we publish pre-trained open IR models for Polish language,d marking a
pioneering development in this field. Additionally, the evaluation revealed
that BM25 achieved significantly lower scores for Polish than for English,
which can be attributed to high inflection and intricate morphological
structure of the Polish language. Finally, we trained various re-ranking models
to enhance the BM25 retrieval, and we compared their performance to identify
their unique characteristic features. To ensure accurate model comparisons, it
is necessary to scrutinise individual results rather than to average across the
entire benchmark. Thus, we thoroughly analysed the outcomes of IR models in
relation to each individual data subset encompassed by the BEIR benchmark. The
benchmark data is available at URL {\bf https://huggingface.co/clarin-knext}.
♻ ☆ Balancing Speciality and Versatility: a Coarse to Fine Framework for Supervised Fine-tuning Large Language Model ACL 2024
Aligned Large Language Models (LLMs) showcase remarkable versatility, capable
of handling diverse real-world tasks. Meanwhile, aligned LLMs are also expected
to exhibit speciality, excelling in specific applications. However, fine-tuning
with extra data, a common practice to gain speciality, often leads to
catastrophic forgetting (CF) of previously acquired versatility, hindering the
model's performance across diverse tasks. In response to this challenge, we
propose CoFiTune, a coarse to fine framework in an attempt to strike the
balance between speciality and versatility. At the coarse-grained level, an
empirical tree-search algorithm is utilized to pinpoint and update specific
modules that are crucial for speciality, while keeping other parameters frozen;
at the fine-grained level, a soft-masking mechanism regulates the update to the
LLMs, mitigating the CF issue without harming speciality. In an overall
evaluation of both speciality and versatility, CoFiTune consistently
outperforms baseline methods across diverse tasks and model scales. Compared to
the full-parameter SFT, CoFiTune leads to about 14% versatility improvement and
marginal speciality loss on a 13B model. Lastly, based on further analysis, we
provide a speculative insight into the information forwarding process in LLMs,
which helps explain the effectiveness of the proposed method. The code is
available at https://github.com/rattlesnakey/CoFiTune.
comment: 43 pages, 10 figures, accepted by ACL 2024 Findings
♻ ☆ Should agentic conversational AI change how we think about ethics? Characterising an interactional ethics centred on respect
With the growing popularity of conversational agents based on large language
models (LLMs), we need to ensure their behaviour is ethical and appropriate.
Work in this area largely centres around the 'HHH' criteria: making outputs
more helpful and honest, and avoiding harmful (biased, toxic, or inaccurate)
statements. Whilst this semantic focus is useful when viewing LLM agents as
mere mediums or output-generating systems, it fails to account for pragmatic
factors that can make the same speech act seem more or less tactless or
inconsiderate in different social situations. With the push towards agentic AI,
wherein systems become increasingly proactive in chasing goals and performing
actions in the world, considering the pragmatics of interaction becomes
essential. We propose an interactional approach to ethics that is centred on
relational and situational factors. We explore what it means for a system, as a
social actor, to treat an individual respectfully in a (series of)
interaction(s). Our work anticipates a set of largely unexplored risks at the
level of situated social interaction, and offers practical suggestions to help
agentic LLM technologies treat people well.
♻ ☆ AnglE-optimized Text Embeddings ACL24
High-quality text embedding is pivotal in improving semantic textual
similarity (STS) tasks, which are crucial components in Large Language Model
(LLM) applications. However, a common challenge existing text embedding models
face is the problem of vanishing gradients, primarily due to their reliance on
the cosine function in the optimization objective, which has saturation zones.
To address this issue, this paper proposes a novel angle-optimized text
embedding model called AnglE. The core idea of AnglE is to introduce angle
optimization in a complex space. This novel approach effectively mitigates the
adverse effects of the saturation zone in the cosine function, which can impede
gradient and hinder optimization processes. To set up a comprehensive STS
evaluation, we experimented on existing short-text STS datasets and a newly
collected long-text STS dataset from GitHub Issues. Furthermore, we examine
domain-specific STS scenarios with limited labeled data and explore how AnglE
works with LLM-annotated data. Extensive experiments were conducted on various
tasks including short-text STS, long-text STS, and domain-specific STS tasks.
The results show that AnglE outperforms the state-of-the-art (SOTA) STS models
that ignore the cosine saturation zone. These findings demonstrate the ability
of AnglE to generate high-quality text embeddings and the usefulness of angle
optimization in STS.
comment: Accepted by ACL24 Main Conference
♻ ☆ ALBA: Adaptive Language-based Assessments for Mental Health
Mental health issues differ widely among individuals, with varied signs and
symptoms. Recently, language-based assessments have shown promise in capturing
this diversity, but they require a substantial sample of words per person for
accuracy. This work introduces the task of Adaptive Language-Based Assessment
ALBA, which involves adaptively ordering questions while also scoring an
individual's latent psychological trait using limited language responses to
previous questions. To this end, we develop adaptive testing methods under two
psychometric measurement theories: Classical Test Theory and Item Response
Theory. We empirically evaluate ordering and scoring strategies, organizing
into two new methods: a semi-supervised item response theory-based method ALIRT
and a supervised Actor-Critic model. While we found both methods to improve
over non-adaptive baselines, We found ALIRT to be the most accurate and
scalable, achieving the highest accuracy with fewer questions (e.g., Pearson r
~ 0.93 after only 3 questions as compared to typically needing at least 7
questions). In general, adaptive language-based assessments of depression and
anxiety were able to utilize a smaller sample of language without compromising
validity or large computational costs.
♻ ☆ Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models
In this paper, we deeply explore several mechanisms employed by
Transformer-based language models in factual recall tasks. In zero-shot
scenarios, given a prompt like ``The capital of France is,'' task-specific
attention heads extract the topic entity, such as ``France,'' from the context
and pass it to subsequent MLPs to recall the required answer such as ``Paris.''
We introduce a novel analysis method aimed at decomposing the outputs of the
MLP into components understandable by humans. Through this method, we quantify
the function of the MLP layer following these task-specific heads. In the
residual stream, it either erases or amplifies the information originating from
individual heads. Moreover, it generates a component that redirects the
residual stream towards the direction of its expected answer. These zero-shot
mechanisms are also employed in few-shot scenarios. Additionally, we observed a
widely existent anti-overconfidence mechanism in the final layer of models,
which suppresses correct predictions. We mitigate this suppression by
leveraging our interpretation to improve factual recall confidence. Our
interpretations have been evaluated across various language models, including
the GPT-2 families, 1.3B OPT, and 7B Llama-2, encompassing diverse tasks
spanning various domains of factual knowledge.
♻ ☆ Capturing Perspectives of Crowdsourced Annotators in Subjective Learning Tasks
Negar Mokhberian, Myrl G. Marmarelis, Frederic R. Hopp, Valerio Basile, Fred Morstatter, Kristina Lerman
Supervised classification heavily depends on datasets annotated by humans.
However, in subjective tasks such as toxicity classification, these annotations
often exhibit low agreement among raters. Annotations have commonly been
aggregated by employing methods like majority voting to determine a single
ground truth label. In subjective tasks, aggregating labels will result in
biased labeling and, consequently, biased models that can overlook minority
opinions. Previous studies have shed light on the pitfalls of label aggregation
and have introduced a handful of practical approaches to tackle this issue.
Recently proposed multi-annotator models, which predict labels individually per
annotator, are vulnerable to under-determination for annotators with few
samples. This problem is exacerbated in crowdsourced datasets. In this work, we
propose \textbf{Annotator Aware Representations for Texts (AART)} for
subjective classification tasks. Our approach involves learning representations
of annotators, allowing for exploration of annotation behaviors. We show the
improvement of our method on metrics that assess the performance on capturing
individual annotators' perspectives. Additionally, we demonstrate fairness
metrics to evaluate our model's equability of performance for marginalized
annotators compared to others.
♻ ☆ Enhancing Small Medical Learners with Privacy-preserving Contextual Prompting
Large language models (LLMs) demonstrate remarkable medical expertise, but
data privacy concerns impede their direct use in healthcare environments.
Although offering improved data privacy protection, domain-specific small
language models (SLMs) often underperform LLMs, emphasizing the need for
methods that reduce this performance gap while alleviating privacy concerns. In
this paper, we present a simple yet effective method that harnesses LLMs'
medical proficiency to boost SLM performance in medical tasks under
privacy-restricted scenarios. Specifically, we mitigate patient privacy issues
by extracting keywords from medical data and prompting the LLM to generate a
medical knowledge-intensive context by simulating clinicians' thought
processes. This context serves as additional input for SLMs, augmenting their
decision-making capabilities. Our method significantly enhances performance in
both few-shot and full training settings across three medical
knowledge-intensive tasks, achieving up to a 22.57% increase in absolute
accuracy compared to SLM fine-tuning without context, and sets new
state-of-the-art results in two medical tasks within privacy-restricted
scenarios. Further out-of-domain testing and experiments in two general domain
datasets showcase its generalizability and broad applicability. Our code can be
found at https://github.com/XZhang97666/PrivacyBoost-SLM.
♻ ☆ Advancing African-Accented Speech Recognition: Epistemic Uncertainty-Driven Data Selection for Generalizable ASR Models LREC 2024
Accents play a pivotal role in shaping human communication, enhancing our
ability to convey and comprehend messages with clarity and cultural nuance.
While there has been significant progress in Automatic Speech Recognition
(ASR), African-accented English ASR has been understudied due to a lack of
training datasets, which are often expensive to create and demand colossal
human labor. Combining several active learning paradigms and the core-set
approach, we propose a new multi-rounds adaptation process that uses epistemic
uncertainty to automate the annotation process, significantly reducing the
associated costs and human labor. This novel method streamlines data annotation
and strategically selects data samples that contribute most to model
uncertainty, thereby enhancing training efficiency. We define a new metric
called U-WER to track model adaptation to hard accents. We evaluate our
approach across several domains, datasets, and high-performing speech models.
Our results show that our approach leads to a 69.44\% WER improvement while
requiring on average 45\% less data than established baselines. Our approach
also improves out-of-distribution generalization for very low-resource accents,
demonstrating its viability for building generalizable ASR models in the
context of accented African ASR. We open-source the code
\href{https://github.com/bonaventuredossou/active_learning_african_asr}{here}.
comment: Preprint Under review. Previously accepted at SIGUL-LREC 2024
Workshop
♻ ☆ From Matching to Generation: A Survey on Generative Information Retrieval
Information Retrieval (IR) systems are crucial tools for users to access
information, widely applied in scenarios like search engines, question
answering, and recommendation systems. Traditional IR methods, based on
similarity matching to return ranked lists of documents, have been reliable
means of information acquisition, dominating the IR field for years. With the
advancement of pre-trained language models, generative information retrieval
(GenIR) has emerged as a novel paradigm, gaining increasing attention in recent
years. Currently, research in GenIR can be categorized into two aspects:
generative document retrieval (GR) and reliable response generation. GR
leverages the generative model's parameters for memorizing documents, enabling
retrieval by directly generating relevant document identifiers without explicit
indexing. Reliable response generation, on the other hand, employs language
models to directly generate the information users seek, breaking the
limitations of traditional IR in terms of document granularity and relevance
matching, offering more flexibility, efficiency, and creativity, thus better
meeting practical needs. This paper aims to systematically review the latest
research progress in GenIR. We will summarize the advancements in GR regarding
model training, document identifier, incremental learning, downstream tasks
adaptation, multi-modal GR and generative recommendation, as well as progress
in reliable response generation in aspects of internal knowledge memorization,
external knowledge augmentation, generating response with citations and
personal information assistant. We also review the evaluation, challenges and
future prospects in GenIR systems. This review aims to offer a comprehensive
reference for researchers in the GenIR field, encouraging further development
in this area.
♻ ☆ Large Language Model-Enhanced Algorithm Selection: Towards Comprehensive Algorithm Representation IJCAI 2024
Algorithm selection, a critical process of automated machine learning, aims
to identify the most suitable algorithm for solving a specific problem prior to
execution. Mainstream algorithm selection techniques heavily rely on problem
features, while the role of algorithm features remains largely unexplored. Due
to the intrinsic complexity of algorithms, effective methods for universally
extracting algorithm information are lacking. This paper takes a significant
step towards bridging this gap by introducing Large Language Models (LLMs) into
algorithm selection for the first time. By comprehending the code text, LLM not
only captures the structural and semantic aspects of the algorithm, but also
demonstrates contextual awareness and library function understanding. The
high-dimensional algorithm representation extracted by LLM, after undergoing a
feature selection module, is combined with the problem representation and
passed to the similarity calculation module. The selected algorithm is
determined by the matching degree between a given problem and different
algorithms. Extensive experiments validate the performance superiority of the
proposed model and the efficacy of each key module. Furthermore, we present a
theoretical upper bound on model complexity, showcasing the influence of
algorithm representation and feature selection modules. This provides valuable
theoretical guidance for the practical implementation of our method.
comment: Accepted by IJCAI 2024
♻ ☆ FEEL: A Framework for Evaluating Emotional Support Capability with Large Language Models
Emotional Support Conversation (ESC) is a typical dialogue that can
effectively assist the user in mitigating emotional pressures. However, owing
to the inherent subjectivity involved in analyzing emotions, current
non-artificial methodologies face challenges in effectively appraising the
emotional support capability. These metrics exhibit a low correlation with
human judgments. Concurrently, manual evaluation methods extremely will cause
high costs. To solve these problems, we propose a novel model FEEL (Framework
for Evaluating Emotional Support Capability with Large Lan-guage Models),
employing Large Language Models (LLMs) as evaluators to assess emotional
support capabilities. The model meticulously considers various evaluative
aspects of ESC to apply a more comprehensive and accurate evaluation method for
ESC. Additionally, it employs a probability distribution approach for a more
stable result and integrates an ensemble learning strategy, leveraging multiple
LLMs with assigned weights to enhance evaluation accuracy. To appraise the
performance of FEEL, we conduct extensive experiments on existing ESC model
dialogues. Experimental results demonstrate our model exhibits a substantial
enhancement in alignment with human evaluations compared to the baselines. Our
source code is available at https://github.com/Ansisy/FEEL.
comment: 14 pages,3 figures and 4 tables