Nandan Thakur¹, Luiz Bonifacio^1,3, Xinyu Zhang¹, Odunayo Ogundepo¹, Ehsan Kamalloo¹, David Alfonso-Hermelo², Xiaoguang Li², Qun Liu², Boxing Chen², Mehdi Rezagholizadeh², Jimmy Lin¹
¹ David R. Cheriton School of Computer Science, University of Waterloo, Canada ² Huawei Noah’s Ark Lab ³ FEEC-Unicamp, Brazil

Abstract

(Lack of Evaluation of Multilingual LLM robustness) RAG 가 LLM external knowledge 에 를 leverage 하여 factual hallucination 을 경감하는데 큰 역할을 하지만, external retreived knowledge 속의 error 에 대한 robustness 에 대한 평가, 특히 영어 이외의 다른 언어 집단에서의 평가는 어렵다.
( NoMARICL ) 18개의 언어에 대하여, RAG 에 대한 LLM robustness 를 측정하는 NoMIRACL benchmark 를 제안한다. 이는 인간이 평가한 No-relevant passage 를 의도적으로 query 로 집어넣어 평가를 한다.
(LLM Robustness) 논문에서는 두 가지 측면에서 RAG 에 대한 LLM robustness 를 측정하는데 (1) hallucination rate 와 (2) error rate 이다.
(Experiment) GPT-4 가 영어나 프랑스 등의 higher-resource language 에서 더 hallucination 을 잘 일으키는 현상을 발견한다. 앞으로 non-relevant information 을 어떻게 잘 reject 하는 지에 대한 foundation 연구가 될 수 있다고 주장한다.

1. Introduction

▶ Challenging issue in RAG
Retrieval Augmented Generation (RAG) 는 reliable knowledge corpora 속의 정보를 LLM 에 잘 주입시키는 역할을 한다. 그러나 RAG 는 LLM 을 통한 강력한 generation 단에 비하여, retrieval system 은 relevant information 을 가져오는데 있어서 어려움을 보이는 경우가 있다. 특히, zero-shot domain 이나 low-resource language 에 대해서는 이런 retrieval 취약점이 더 잘 드러난다. 이러한 incorrect 하거나 non-relevant 한 information 은 LLM 으로 하여금 hallucination 을 일으키게 만든다. 하지만, 현재까지 low-resource 등 multilingual setting 에서 LLM reasoning capa 를 측정한 연구는 없었다.

▶ NoMIRACL
이 논문에서는 first-stage external information 에서의 error 에 저항하는 LLM robustness 를 측정하는 NoMIRACL benchmark 를 소개한다. 30 명의 native speaker 를 고용하여 dataset 을 구성하였다. NoMIRACL 은 두 개의 subset 으로 구성되어있으며, 각각 non-relevant, relevant 이다.

GPT-4 를 baseline 으로 하여 실험한 결과, GPT-4 가 non-relevant passage 에 대해 33.2% hallucination rate 을 보이는 것을 관찰하였고, relevant subset 에 대해서는 14.2% 의 비교적 낮은 error rate 을 보이는 것을 관찰하였다. 이를 통해 RAG 에서의 retrieval 단의 중요성을 확인할 수 있다.

추가적으로, GPT-4 hallucination rate 과 language resource size 에 positive correlation 이 있음을 통해, higher-resource (영어, 프랑스어등) 에서 더 많은 hallucination 이 있음을 관찰한다.

2. Background and Problem Identification

2.1. Retrieval-Augmented Generation (RAG)

RAG 는 factual correctness 를 위한 최신 연구에서 가장 핵심적인 기술이다. RAG 에 대해서 간단히 background 를 설명하면, first-stage 로 retriever 가 주어진 query 에 대한 top-k passage 를 retrieve 해온다. 이후, LLM 을 활용한 generation 단에서 second-stage로, query 와 retrieved top-k passage 를 활용하여 output 을 생성한다.

2.2. Robustness Evaluation

LLM 의 Robustness 평가는 위의 contingency table 로 간단하게 측정한다. Non-relevant passage 로부터는 “I don’t know”를, relevant passage 로부터는 “Yes, Answer is Present” 를 생성해야한다.

3. NoMIRACL Dataset

Data Construction Procedure

NoMIRACL 은 MIRACL dataset 을 기반으로 만들어진다. 첫 번째로, annotator (native language speaker) 가 prompt text 에 대해 well-formed query 를 생성하도록 요구된다. 각각의 prompt 는 language-specific Wikipedia 의 첫 100 단어 snippet 이다.

이후, hybrid multilingual retreival system 이 top-k passage 를 retireve 해온다. 이 것을 각각 annotator 들이 relevant, non-relevant 로 label 한다. 이 때, 모든 top-k passage 가 relevance 0 으로 label 되면 non-relevant subset 이 된다.

4. Experiemental Setup

Retriever system : BM25 + mDPR + mColBERT (mBERT trained with MS MARCO) LLM 에게 retrieved subset 을 보고 “I don’t know”, “Yes, answer is present” 중 하나를 고르게 하여, hallucination rate = FP/(FP+FN), error rate = TN/(TN+TP) 를 측정한다. prompt 는 아래와 같다.

5. Experimental Results

GPT-4 hallucination rate on the non-relevant subset

(1) GPT-4 는 33.2% hallucination rate 를 보여, all non-relevant passage 를 알아차리는 것이 쉽지 않다는 것을 보인다. (2) 프랑스어 스페인어 등 corpus size 가 큰 resource 에서 hallucination rate 이 큰 강한 상관관계를 확인한다.(Spearman 0.39)

GPT-4 error rate on the relevant subset

14.9% 정도의 낮은 error rate 을 보여, retrieve 가 잘 된 relevant subset 에 대하여는 LLM 이 잘 identify 할 수 있음을 보인다.

Conclusion

Retrieval-augmented generation setups are effective in the factual grounding of LLM-generated output with information available in external knowledge bases. However, in cases where retrieved information does not assist in providing the answer, a robust LLM should potentially identify them. In our work, we provide NoMIRACL for evaluating LLM robustness as a binary classification task on 18 languages. We provide two subsets in NoMIRACL, the non-relevant subset, where queries contain all non-relevant passages, and the relevant subset, where queries contain at least a single relevant passage. We evaluate robustness using two metrics: hallucination rate and error rate.

In our work, we build a GPT-4 baseline on NoMIRACL. GPT-4 achieves a high 33.2% hallucination rate on the non-relevant subset and 14.9% error rate on the relevant NoMIRACL split, highlighting that GPT-4 finds it challenging to dismiss non-relevant passages over relevant passages in first-stage retrieved information. We open-source NoMIRACL and hope it encourages future work to improve LLM robustness.

Yongil Kim

[Arxiv 2312] NoMIRACL: Knowing When You Don’t Know for Robust Multilingual Retrieval-Augmented Generation