Akari Asai^†, Zeqiu Wu^†, Yizhong Wang^†§, Avirup Sil^‡, Hannaneh Hajishirzi^†§
^† University of Washington ^§ Allen Institute for AI ^‡ IBM Research AI

Abstract

(Hallucination and RAG) LLM 의 강력한 성능에도 Hallucination (Factual inconsinstency) 문제는 여전히 발생하고, Retrieval-Augmented Generation (RAG) 을 기반으로한 LM 모델이 이러한 issue 를 잘 다룬다.
(Unrelevant retrieval probelm) 그러나, retrieval 자체가 necessary 하지 않거나, passage 가 relevant 하지 않은 경우에는 이러한 Retireval Augment 방법 자체가 response generation 에 unhelpful 할 수 있다.
( Self-RAG ) 이에 본 연구에서는 Self-Reflective Retrieval-Augmented Generation (Self-RAG) 방법을 제안한다. 이는 하나의 LM 이 (1) passage 를 on-demand 로 retrieve 해오고, 이를 통해 (2) generate 한 이후, (3) refelection token 을 통해 retrieved passage 와 own generation 을 reflect 한다.
(Controllability) inference 과정에서 다양한 task 요구에 맞춰 reflection token 을 조절할 수 있다.
(Experiment) 실험 결과 7B, 13B 모델이 state-of-the-art RALM 을 능가하는 성능을 보였고, Open-domain QA 에서 ChatGPT 와 retrieval-augmented LLaMa2-chat 의 성능을 뛰어넘었다. 그리고 Long-form generation 에서의 factuality accuracy 도 매우 높다.

1. Introduction

▶ Retrieval-Augmented Generation (RAG)
LLM 의 강력한 성능에도 hallucination 문제가 많자, 위의 Figure-left 처럼 retrieval 을 붙인 RAG (or RALM) 이 연구가 많이 되고 있다. 그러나 이러한 것들은 unnecessary or off-topic passage 를 introduce 하여 low-quality generation 을 발생시킨다. 이러한 이유의 가장 큰 이유는 wheter the factual grounding is helpful 에 regardless 하게 가져오기 때문이다. 또한, generation LM 자체가 Retrieval 된 것을 활용하도록 학습되지는 않았기 때문에 generation 과정에서 retrieved relevant passage 에 consistent 하게 생성하는지도 알기 힘들다.

▶ Self-Reflective Retrieval-augmented Generation (Self-RAG)
이에 저자들은 Self-Reflective Retrieval-augmented Generation (Self-RAG) 를 제안한다. 이는 on-deman retrieval 과 self-refelection 방법을 통해 위의 vesatility 를 극복한다. 저자들은 special token 인 reflection token 을 활용하여, End2End 방법으로 generation 과 reflection 두 가지를 학습한다. Reflection 토큰은 retireval 토큰과 critique 토큰으로 나뉘고, 이들은 각각 need for retrieval 과 generation quality 를 판단한다.

좀 더 자세히 살펴보면, input prompt 와 preceding generation 에 대하여,

(1) SELF-RAG 는 우선 retrieved passage 로 continued generation 을 augment 하는게 좋은지 판단하고,
(2) 그렇다면, retrieval token을 출력하여 on-demand 로 retriever 를 call 한다.
(3) 이후 multiple retrieved passage 에 대하여, relevance 를 evaluate 하고, corresponding task output 을 generate 한다.
(4) 이후 critique token 이, output 을 factuality and overall quality 의 관점에서 criticize 한 이후 best one 을 고른다.

SELF-RAG 는 model vocab 을 확장하여, next token prediction 에서 reflection token 을 생성하도록 학습된다. Reflection token 은 RL의 reward model 에 영감을 받아, critic model 을 통해 학습된 original corpus 에 직접적으로 offline 으로 추가된다. Critic model 은 input, output, 그리고 GPT-4 등의 propriety LM 에 prompt 되어 collect 된 reflection token 으로 이뤄진 dataset 에 의해 학습된다. 또한, text generation 에서 쓰이는 control token 에 영감을 받아, prediction 을 assess 하고 최종 generation output 에 쓰일 critique token 을 추가활용한다.

이 Reflection token 을 통해, SELF-RAG 는 customizable decoding 을 할 수 있다. 예를 들어, retrieval frequency 를 유연하게 조정하거나, user-preference 에 맞게 reflection token prob 을 활용하여 decoding 을 조정할 수 있다.

▶ Experiments
실험 결과, reasoning 과 long-form generation 을 포함한 여섯개의 task 에 대하여, SELF-RAG 가 pre-trained and instruciton-tuned LLM 을 significantly outperform 한다. (Retreival-augmented ChatGPT 에는 4개의 task 를 앞서고, LLaMa2-chat 과 Alpaca 에 대해서는 모든 task 에서 앞선다.)

2. SELF-RAG: Learning to retrieve, generated and critique

앞서 말했듯, reflection token 의 도입을 통해, End2End LM 으로 하여금 generate, retrieve 그리고 criticize 를 하도록 한다.

2.1. Problem formalization and overview

Given input $x$ 에 대하여, $y=[y_1, …, y_T]$를 생성한다. 이 때, 각 $y_t$는 original vocab 에 추가적으로 reflection token 을 갖는다. reflection token 의 종류는 아래와 같다.

(1) Inference overview

Given $x$ 와 $y_{<t}$ 에 대하여, 모델은 retrieval 의 utility 를 평가할 retireval token 을 decode 한다. 여기서 두 가지로 나뉘는데, 우선 Retreival 이 필요하다고 판단될 경우, critique token 인 IS_REL (retireved passage 의 relevancy 를 평가), IS_SUP (생성된 Response 가 passage 에 supported 한지), IS_USE token 을 생성한다. 이후, 이 것들을 통해 multiple passage 를 rank 한다. Retrieval 이 필요없다고 판단될 경우, 다음 token 을 예측하고, IS_USE token 을 통해 criticize 한다.

(2) Training overview
Reflection token 을 vocab 에 추가한 뒤, 일반적인 next token prediction 을 통해 학습된다. 자세히 보면, generator LM 과 reflection token 을 학습하는데 이때 reflection token 을 critic model 에 의해 predict 된 것이다. Critic model 을 이용하여, training corpus 에 reflection token 을 삽입하여 update 한 뒤 training corpus 로 활용한다.

2.2. SELF-RAG Training

(1) TRAINING THE CRITIC MODEL

Data collection for critic model Manual annotation 이 비싸므로 GPT-4 를 활용하여 annotation 하면 좋다. 하지만 이러한 상업적인 LM 을 쓰는 것은 비싸고 재현성이 떨어지므로, 저자들은 GPT-4 를 prompting 하여 reflection token 을 몇 개 만든 후, in-house critic model 에 distll 한다.

GPT-4 에게 “Given an instruction, make a judgment on whether finding some external documents from the web helps to generate a better response.” 이런 식으로 prompting 한 후 reflection token 을 만든다. GPT-4 에게 시킨 이후, 인간이 평가하여 높은 점수를 얻은 것들을 추려 4k 개를 확보한다.

Critic learning

이후 4k 를 통해 crtiic model 에 distllation 학습을 한다.

아래 식을 통해 critic learning 을 진행한다.

본 연구에서는 generator LM 과 같은 Llama 2-7B 모델을 활용하였다.

(2) TRAINING THE GENERATOR MODEL

Data collection for generator

input-output pair $(x,y)$ 에서 각 output 은 여러 segment $y_t$ 로 이뤄진다. 각각의 $y_t$ 에 대하여, 학습된 critic model 이 reflection token 을 부여한다. 예를 들어, Retrieve token = Yes 가 부여되면, retriever 가 top-k passage 를 retrieve 해온다. 다시, 각 passage 에 대해 critic model 이 IS_REL 를 생성하여 passage 의 Relevant 여부를 결정한다. IS_REL = yes 라면, model generation 을 support 하는지의 IS_SUP 이 부여된다. 이후, critic model 은 생성된 generation $y_t$ 의 끝에 IS_USE token 을 부여한다.

Generator Learning

위의 자연스러운 next token prediction 을 통해 학습된다.

2.3. SELF-RAG INFERENCE

여러 경우에 Retrieval 이 필요하지 않은 경우도 있다. 예를 들어, esay 를 작성하거나하는 등의 open-ended task 에서는 retrieval 을 줄이는 것이 creativity 에 도움이 될 수 있다. 이에 SELF-RAG 는 adaptive 하게 Retrieval token 을 사용하거나 사용하지 않거나 할 수 있다.

추가적으로, Critique token 을 활용한 Tree-decoding 방법도 inference 시에 고려가능하다. (논문참조)

3. EXPERIMENTS

3.1. TASKS AND DATASETS

Closed-set Tasks : (1) Fact verficiation dataset : PubHealth (public health dataset) , (2) Multiple-choice reasoning dataset : ARC-Challenge (scientific exams)
Short-form generation tasks : PopQA, TriviqaQA-unfiltered
Long-form generation tasks : (1) biography generation task, (2) long-form QA task : ALCE-ASQA

3.2. BASELINES

Baselines without retrievals : LLaMA2-7B, LLaMA2-13B, Alpaca-7B, Alpaca-13B, ChatGPT, LLama2-chat-13B., COVE-65B
Baselines with retreivals : Rretrieval augmented LLama2, Rretrieval augmented ChatGPT

4. RESULTS and ANALYSIS

4.1. MAIN RESULTS

Comparison against baselines without retrieval

[Table2 Top]

SELF-RAG (bottom two rows) demonstrates a substantial performance advantage over supervised fine-tuned LLMs in all tasks and even outperforms ChatGPT in PubHealth, PopQA, biography generations, and ASQA (Rouge and MAUVE)
SELF-RAG also significantly outperforms a concurrent method that employs sophisticated prompt engineering

Comparison against baselines with retrieval

[Talbe2 Bottom]

SELF-RAG also outperforms existing RAG in many tasks, obtaining the best performance among non-proprietary LM-based models on all tasks

4.2. ANALYSIS

Ablation studies

[Figure3 (a)]

Effects of inference-time customization

[Figure3 (b)]

Efficiency and accuracy trade-off

[Figure3 (c)]

Effects of training data size

[Figure4 (a)]

Human evaluations

[Figure4 (b)]

Conclusion

This work introduces SELF-RAG, a new framework to enhance the quality and factuality of LLMs through retrieval on demand and self-reflection. SELF-RAG trains an LM to learn to retrieve, generate, and critique text passages and its own generation by predicting the next tokens from its original vocabulary as well as newly added special tokens, called reflection tokens. SELF-RAG further enables the tailoring of LM behaviors at test time by leveraging reflection tokens. Our holistic evaluations on six tasks using multiple metrics demonstrate that SELF-RAG significantly outperforms LLMs with more parameters or with conventional retrieval-augmented generation approaches.

초록색볼드체

초록색배경 빨간색배경

Yongil Kim

[ICLR2024] SELF-RAG: LEARNING TO RETRIEVE, GENERATE, AND CRITIQUE THROUGH SELF-REFLECTION