Yongil's Research Blog

[EMNLP2023] Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Wed, 17 Apr 2024 04:20:00 +0000

Ning Ding^1∗, Yulin Chen^2,3∗, Bokai Xu⁴, Yujia Qin^2,3, Shengding Hu^2,3, Zhiyuan Liu^2,3†, Maosong Sun^2,3†, Bowen Zhou^1†
¹ Department of Electronic Engineering, Tsinghua University, ² Department of Computer Science and Technology, Tsinghua University, ³ BNRIST, IAI, Tsinghua University, ⁴ The Chinese University of Hong Kong, Shenzhen

Abstract

(Finetuning chat Language model) ChatGPT 와 같은 chat language model 을 instruction data 를 통한 fine-tuning 하는 것은, diversity 와 quality 가 받춰줄 때 좋은 성능을 끌어올릴 수 있는 방법이다.
( UltraChat ) Human query 를 포함하지 않은 large-scale instructional conversation 을 담고 있는 UltraChat 을 제안한다. 이 데이터셋은 scale, average length, diversity, coherence 등에서 우수성을 보인다.
(UltraLM) UltraChat 을 활용하여 LLaMA 를 finetuning 하여, UltraLM 을 만들었고, WizardLM 과 Vicuna 를 포함한 open-source model 보다 좋은 성능을 보인다.

1. Introduction

▶ Chat LLM
Large Language Model (LLM) 은 놀라울 만한 성능을 보이며, conversation 에 특화된 ChatGPT 와 같은 Chat LLM 은 선풍적인 인기를 끌고 있다. 현재 많은 open-source Chat LM 모델들이 공개되었지만, ChatGPT 나 GPT-4 는 고사하고 Vicuna 를 이기는 모델 조차 없다. (2023년 5월 기준)

▶ UltraChat
이 연구에서는 가장 단순한 방법으로 성능을 끌어올릴 수 있다고 믿는다: Quality and diversity of training data play a vital role in further imporving the performance of chat language model. 다시 말해, 높은 quality 와 더 다양한 data 가 더 좋은 결과를 이끌어낼 수 있다는 것이다. 저자들은 million-scale multi-turn instructional conversation data 인 UltraChat 을 공개한다. QA 나 Summarization 같은 task 를 활용하여 conversation 을 구성하지 않고, Questions about the World, Creation and Writing, Assistance on Existing Material 이라는 세 sector를 curate 한다. 이후 realistic multi-run conversation 을 구성하기 위하여, 두 독립적인 ChatGPT Turbo API 에게 대화를 진행시켜 query 와 response 를 생성하게 시킨다.

▶ Experiment
LLaMA-13B 모델에 UltraChat 을 학습시켜 UltraLM 을 만들었다. UltraLM 은 GPT-4 로 평가되었을 때 가장 높은 점수를 기록하였으며, 모든 open-source model 을 능가하는 퍼포먼스를 보인다.

Instruction Tuning
FLAN-T5 가 60 개의 NLP dataset 을 학습하여, LM 이 instruction tuning 을 통해 instruction following 능력을 갖출 수 있음을 보인 뒤, 많은 모델들이 instruction tuning 을 통해 학습되었다. T0 와 InstructGPT 가 대표적인 예시이고, FLAN2022 에서는 다양한 task 를 배우는 것이 out-of-distribution 일반화 성능이 좋음을 보였다. InstructGPT 이후에는 강화학습을 이용한 human preference 학습에 대해서도 많은 연구가 있어왔다.

Data Augmentation with LLMs
Large-scale human-annotated instruction 을 모으는 것은 매우 어려운 일이다. 이를 위해 ChatGPT 나 GPT-3.5 와 같은 well-tuned LLM 으로 부터 sampling 한 data 를 모으는 것이 주목받고 있다. 예를 들어, Self-Instruct 나 Alpaca 는 Text-Davinci-003 을 distilling 하여 high-quality 의 instruction-reponse pair 를 생성한다. Alpaca 의 성공은 LLM 에 data augmentation 을 부추겼다. 그 성공으로, code-alpaca, alpacacot, GPT4ALL, ShareGPT, Dolly-v2, BELLE, Vicuna, Koala, Baize 등이 탄생하였다. CAMEL 의 경우, multi-agent role-play 환경을 통해 real human conversation 을 simulate 한다.

3. Data Construction

아래 표는 ChatGPT 를 활용한 direct multi-turn dialog generation 결과와 UltraChat 의 비교이다.

대화 데이터의 퀄리티를 결정하는 두 개의 key point 를 발견할 수 있다. (1) An opening line determines the topic of the dialogue (2) A user determines the plot of the dialogue, and the output should be tailored to the current topic with diverse language styles and requests

따라서 기존의 방식대로 comprehensive open-domain instructional chat datset 을 모으는 것과 다르게, data collection schema 는 interaction 을 잘 capture 할 수 있어야 data quality 를 증가시킬 수 있다. UltraChat 은 다음의 세 가지 스키마로 design 된 conversation data 를 cover 한다: (1) Questions about the World, (2) Creation and Writing, (3) Assistance on Existing Materials. Diversity 는 opening line 에 크게 의존하므로, 다양한 set 의 opening line 과 user 를 prompt 하는 방법에 치중되어있는 방법론을 제안한다.

3.1. Questions about the World

우선 Real world 에 존재하는 concept, object, entity 에 관한 정보를 query 하는데 집중한다. 우선 아래의 Table 2 처럼 ChatGPT 로 하여금 30개의 concept 을 추천받는다.

이후 30개에서 50개로 subtopic 으로 dive 한다. 마지막으로, 각각 subtopic 혹은 concept 마다 10 개의 다른 질문을 생성하고, ChatGPT 로 하여금 각각 question 을 기반으로 10개의 question 을 더 만들게 한다.

다른 concept 을 결정하는 방법은 Wikidata entity 를 이용하는 방법이다. 가장 많이 등장하는 10,000 개의 entity 에 대하여 5 개의 meta-question 을 생성하고, 각각 10개의 specific question 과 20 extended question 을 생성한다. 이후 filtering 과정을 통해, 500,000 (500K) 개의 질문을 opening line 으로 만든다.

3.2. Creation and Writing

두 번째는 email 작성이나 수필/연극 작성 처럼 human-input condition 에 대한 새로운 정보를 생성하는 과정이다. 이는 AI assistant 의 창의성을 활용하는 과정이다.

우선 아래 Table 3 와 같이 20 개의 text material type 을 고른다.

이후 ChatGPT 를 활용하여 다양한 instruction 을 생성한 뒤, 다시 ChatGPT 를 통해 refine 한다. 이 instruction 은 dialog generation 의 opening line 으로 활용된다.

3.3. Assistance on Existing Materials

C4 corpus 에서 text material을 수집하고 다양한 콘텐츠 유형을 위해 수동으로 키워드를 선별하며 텍스트를 URL과 키워드를 일치시켜 분류한다. 그리고 ChatGPT에게 100,000개의 수집된 text material 각각에 대해 다섯 가지 instruction 를 생성하도록 요청하여 opening line으로 총 500,000개의 조각이 생성된다.

Dialog History 만을 user model 에게 주면 마치 AI assistant 처럼 대답을 하는 현상이 있다. 이것은 multi-turn conversation 을 만드는데 매우 안좋은 요소가 된다. 따라서 저자들은 user personality 를 추가적으로 부여한다. 이렇게 Dialog data 가 생성이 된 이후에 filtering 과정을 거친다.

4. Data Analysis

4.1. Statistical Analysis

4.2. Human Assessment

※ 자세한 setting 은 논문참조.

5. Experiments

LLaMA-13B 모델에 UltraChat 을 학습시킨다. 단순히 dialog 를 적은 sequence 로 쪼개 2048 토큰 안에 들어오게 한 뒤, 일반적인 LM loss 로 학습시킨다. 128 A100 GPU 를 활용하여 512 batch size 로 학습시킨다.

5.1. Experimental Setup

Baselines

Backbone : LLaMA, Pythia
Baseline : Alpaca, Vicuna, Koala, Dolly, OpenAssistant, WizardLM ChatGPT, MPT, Biaze

Datasets

Benchmark Evaluation : ARC-CHallenge, HellaSwag, MMLU, TruthfulQA
Response Quality Evaluation : GPT-4, AlpacaEval, Evol-Instruct-test

5.2. Benchmark Evaluation

UltraLM는 순수한 지시 튜닝을 통해 UltraChat 데이터셋에서 LLaMA-13B보다 큰 성능 향상을 보이며, 네 가지 벤치마크에서 SOTA 를 보인다. 이는 UltraLM이 World knowledge 와 commonsense knowledge 에 대한 광범위하고 깊은 이해력을 갖추고 있음을 보여준다.
이러한 개선은 UltraChat 데이터 구축 과정으로 인한 향상이며, 대화 생성에서 world knowledge 에 대한 논의를 확장하고 깊이 있게 다룬다. 한편, MMLU에서의 비교적 떨어지는 성능은 특정 분야의 전문 지식 부족을 시사하며, 특화된 Expert LM 을 구축하기 위해 higher quality 의 데이터 생성 기술이 필요함을 시사한다.

5.3. Response Quality Evaluation

Response Comparison

UltraLM은 모든 open-source model 보다 우수한 성능을 나타내며 최대 98% 의 인상적인 win-rate을 보인다.
UltraLM이 Vicuna보다 9% 더 높은 승률을 기록하는 것도 주목할 만하다.

Independent Scoring

Pairwise comparison의 불안정성을 고려하여 GPT-4로 독립적인 품질 점수 산정도 진행한다.
UltraLM 은 전체 점수 측면에서 모든 open-source model 들보다 현저히 우수한 성능을 보여주며, 이는 각 모델의 성능을 구체적인 유형의 질문과 명령에 대한 인사이트를 제공한다.
모든 모델이 commonsense knowledge 와 general world comprehension 에 관련된 간단한 질문에서 더 좋은 성과를 내지만, 추론과 창의적 글쓰기와 관련된 보다 복잡한 작업은 대부분의 어려워한다.

AlpacaEval

AlpacaEval leaderboard 에서 text-davinci-003 과의 win-rate 비교에서 4위를 차지한다.

Evol-Instruct Evaluation

Evol-Instruct-test 데이터셋에서 WizarLM 과 비교한다.
모든 question 에서 29% 향상이 있으며, WizarLM 이 Evol-Instruct 로 학습된 것을 감안하면 매우 훌륭한 결과이다.

Impact of System Prompts

System prompt 를 사용하여 UltraLM의 응답 품질을 향상시킨다.
이러한 prompt 는 답변 정확도에 큰 영향을 미치지는 않지만, 주로 정보를 더욱 풍부하게 제공하여 생성된 출력물의 전반적인 품질을 크게 향상시킨다.

6. Conclusion

In drawing to a close, our work introduces UltraChat, a structured design of multi-turn instructional conversation data primed to foster the growth of general chat models. UltraChat encapsulates a broad range of human-AI interactions, further developing a series of dialogues across various topics and instructions. Statistically, UltraChat shows an impressive presence in critical metrics such as scale, average length, diversity, and consistency, further establishing itself as a leading open-source dataset. We leverage UltraChat to fine-tune the LLaMA model, leading to the development of the robust conversational model, UltraLM. Evaluation across multiple benchmarks reveals that UltraLM surpasses previous open-source models like WizardLM, Vicuna, Alpaca, and Koala in performance.
We eagerly await the innovative research and development that will be catalyzed by our contributions in the field of AI conversational models.

[ICLR2024] #INSTAG: INSTRUCTION TAGGING FOR ANALYZING SUPERVISED FINE-TUNING OF LARGE LANGUAGE MODELS

Mon, 15 Apr 2024 04:20:00 +0000

[pdf] [github]

Keming Lu∗& Hongyi Yuan∗& Zheng Yuan & Runji Lin & Junyang Lin & Chuanqi Tan & Chang Zhou & Jingren Zhou
Alibaba DAMO Academy

Abstract

( Lack of diversity in instruction-following data ) LLM 을 supervised fine-tuning (SFT) 을 통해 instruction 을 학습시킬 수 있다. 이를 위해 좋은(good) instruction-following dataset 이 필요한데, 현재 diversity 와 complexity 의 측면에서 데이터가 희박하고 분석이 부족하다.
( INSTAG ) 이제 저자들은 INSTAG 라는 open-set instruction tagging method 를 제안한다. 이는 tag 를 통해 human instruction 의 semantic 과 intention 을 부여하여, instruction diversity 와 complexity 를 정량적으로 분석할 수 있게한다.
(Data sampling procedure) INSTAG 의 diverse and complex instruction 을 통해 LLM 학습에 효과를 본 것을 토대로, data sampling procedure 를 통해 6K 개의 sample 을 선별한다.
(TAGLM) INSTAG 를 학습한 모델인 TAGLM 이 MT-bench 에서 다른 open-source model 을 압도한다.

1. Introduction

▶ Fine-tuning LLMs
LLM 을 finetuning 하는 것은 LLM 으로 하여금 Human preference 에 align 시키고 human intention 을 recognize 하게끔 만들어준다. 이러한 finetuning 방법에는 Supervised Fine-tuning (SFT)([1-Alpaca],[2-Vicuna]), rejection sampling([3-RRHF],[4-PRO],[5-DPO]), RLHF([6-RLHF],[7-InstructGPT],[8-LLama2]) 등이 존재한다.

▶ SFT for LLMs
그 중에서도 SFT for alignment 는 보통 multi-turn utterance manner 로 형성되며, 각 turn 은 human query 와 human preference 에 well-aligned 된 reponse 로 구성된다. 이러한 SFT 데이터셋들은 보통 crowd-sourcing data 를 활용하거나, 다른 LLM 으로 부터 distilling 하는 방법을 통해 모인다.

최근 여러 연구들에서 이러한 alignment 를 위한 SFT training data 들은 반드시 diverse/complex/covering various domains/tasks/semantics 등의 특징을 지녀야 한다고 주장한다. ([9-WizardLM],[10-Orca],[11-TULU]) 이러한 diversity 와 complexity 는 주로 query formation 에 의해 결정된다. 다양한 연구에서 SFT-aligned LLM 의 성능을 끌어올리기 위하여, query 의 diversity 와 complexity 를 발전시키기 위해 방법론들을 제안하였지만, 어떠한 연구에서도 diversity 와 complexity 를 정량적으로 측정하려는 연구는 없었다.

▶ INSTAG
이를 위해 저자들은 SFT dataset 들의 sample 을 categorize 하는 tagging system 을 제안한다. 다재다능한 task 를 풀기 위해서는 다재다능한 tagging system 이 필요하지만, manual 한 fine-grained tagging system 은 large scale dataset 에 적용하기 너무 어렵다. 이에 저자들은 ChatGPT 를 활용하는 INSTAG 라는 automatic Instruction Tagging method 를 제안한다. ChatGPT 의 prompting 에 심혈을 기울여, systematic tagging system 을 구성하고, INSTAG 를 SFT dataset 에 적용하여, human query 에 semantic 과 intention 을 잘 tagging 할 수 있음을 검증한다. 이 과정에서 diversity 와 complexity 의 측면에서 정량적으로 query distribution 을 측정할 수 있는 세세한 분석을 제공한다. 당연하게도, 이 분석과정에서 더 diverse 하고 더 complex 한 query 가 SFT 를 통해 alignment performance 를 향상시키는 것을 보인다. 이 검증에 따라, INSTAG 를 data selector 로 활용하여, compexlity-first diverse ampling method 를 구성하여 데이터를 모으고, 이 데이터를 학습시킨 LLM 이 MT-Bench 에서 좋은 성능을 보인다.

▶ Contributions
논문의 contribution 을 정리하면 아래와 같다.

(1) Instruction diversity/complexity metric 으로써의 open-set fine-grained intention tagging 방법인 INSTAG 를 제안한다.
(2) Query divserity 와 complexity 에 대한 분석으로 insight 를 제공한다.
(3) INSTAG 를 통한 data selection 을 통해 좋은 데이터를 모으고, 이를 학습하여 LLaMA 기반의 TAGLM 을 제안하여, MT-BENCH 에서 좋은 성능을 보인다.

Data for Human Alignment
It has been highlighted that the performance of aligned LLMs is affected by the quality of the SFT data. 이러한 Data quality 은 response-level([11],[12]) 에서 존재하거나, task difficulty([13]), query complexity([14]), semantic diversity([15],[16]), 그리고 sample amount scale([17]) 에 존재할 수 있다.

Self-Instruct, Evol-Instruct 등도 diversity 와 complexity 를 증가시킬 수 있는 방법이다. Orca 에서는 FLAN 의 response 와 query 를 기존의 LLM 을 활용하여 rewrite 하여 NLP task 를 푸는데 성능 향상을 가져왔다. UltraChat 에서는 manual 하게 design 한 다양한 anchor concept 과 entity 를 통해 ChatGPT 와의 대화를 통해 multi-turn data 를 생성한다. OpenChat 과 Vicuna 는 SharGPT 를 통해 GPT-4 의 user log 를 학습하여 cutting-edge instruction following 능력을 갖춘 ChatLLM 모델이다. OpenChat 에서는 ShareGPT 를 통한 user log 로부터 query 를 활용하는 것은 instruction following 능력을 향상시킨다는 결과가 있다. Lima 에서는 적은 양의 high-quality data 만으로도 human alignment 를 잘 학습시킬 수 있음을 보인다.

이렇듯 human intention 을 LLM 에 학습시키기 위해 diverse and complex SFT data 를 활용하는 연구가 많이 존재하지만, 여전히 query 의 diversity 와 complexity 를 정량적으로 측정하고 논의하는 연구는 부족하다. 이 연구에서는 ChatGPT 의 퍼포먼스를 바탕으로 automatic tagging system 을 제안하여 training data 의 diversity 와 complexity 를 정량적으로 제안한다.

3. INSTAG

3.1. OPEN-SET FINE-GRAINED TAGGING

최근 Chatbot 에 prompt 로 활용이 되는 Instruction 은 복잡하고 multifacted 되어 있는 user intention 의 표현이다. 위의 Figure1 의 ShareGPT 의 예시(Write flask routes for blog posts that implement CRUD. Use flask-sqlalchemy. The incoming and outgoing data should be in JSON. Use appropriate error handling and return status codes)와 같이, user intention 은 복잡하기 때문에 fine-grained tag 가 필요하다. 그러나 이러한 fine-grained tag 를 얻는 것은 어려운데 annotation 과 normalization 이 어렵기 때문이다. 이에 저자들은 ChatGPT 를 활용한 automatic tagging system 과 normalization technique 을 제안한다. 아래의 prompt 를 ChatGPT 에 부여하여 few-shot ICL 을 통해 tagging 을 한다.

3.2. TAG NORMALIZATION

위의 방법대로 ChatGPT 가 출력한 original raw tag 는 12,000 개로 다양한 fine-grained tag 를 생성할 수 있음을 알 수 있지만 너무 noise 하다는 단점이 있다. 예를 들어, 아래의 Table 1 과 같은 inonsistency 들을 포함할 수 있다.

Lexical Noise 는 ChatGPT 의 instability 로 인해 발생하는 것으로 post-processing 으로 간단히 해결 가능하다. Uncontrolled Granularity 는 너무 specific 한 tag 를 생성하는 경우이고, Spurious Corrletaion 은 ChatGPT 의 bias 에 의해 발생한다.

따라서 저자들은 위의 Figure 1 과 같이, 다음의 네 가지 normalization procedure 를 통해 raw tagging 을 cleaning 한다.

Frequency Filtering : $\alpha$ time 미만의 long-tail tag 는 filter-out 한다.
Rule Aggregation : Lexcial noise 해결을 위해, 모두 소문자화하고 특수문자를 공백처리하는 post processing 을 제거한다.
Semantic Aggregation : PhraseBERT 혹은 DensePhrase 같은 text embedding model 을 활용하여 tag 의 semantic 을 얻고, DBSCAN 알고리즘을 활용하여 tag 를 cluster 하여 대표(representative) tag 로 뭉친다.
Association Aggregation : Mathematics 나 coding query 에서 주로 발생하는 atomic tag 문제 해결을 위해, FP-Growth 알고리즘 을 적용하여 association 통합을 한다.

저자들은 INSTAG 방법을 SHAREGPT, OPENChat, UltraCHAT, Alpaca, WizardLM, FLAN, Dolly, OAssist, Unnatural, Lima, Math Collections, Code Collections 등 17개 데이터셋에 적용한다. $\alpha$ 는 20 으로 한 뒤 나머지 aggregation 방법을 적용한 결과, 1,772 개가 남았다.

3.3. QUALITY EVALUATION

GPT-4 와 human annotator 들을 활용하여 tagging quality 를 분석한다. 분석 메트릭은 다음의 두 가지이다.

Precision : Query-Tag 사이의 일치도를 본다.
Consistency : Tag 와 그 tag 에 속하는 randomly selected instruction 사이의 일치도를 본다.

결과는 아래와 같다.

※ 자세한 결과분석은 논문 참고

3.4. PRELIMINARY ANALYSIS

Open-source dataset 에 대한 normalized tag 의 분석 결과는 Figure 2 에서 볼 수 있다.

Diversity : query 속의 semantic 과 intention 의 range 를 측정한다. dataset 이 individual tag 를 많이 가지면 가질 수록 diverse 하다고 판단한다.
Complexity : 하나의 query 가 여러 개의 tag 에 assign 되어 있을 수록 complex 한 query 이다. 따라서 dataset 속의 query 들의 average tag number 가 complexity 의 척도가 된다.

분석 결과 Diversity 와 Complexity 에 대해서 아래의 네 가지 발견을 할 수 있다.

(1) Tag-based metrics well presents diversity and complexity : WizardLM(Alpaca) 는 Evol-Instruct 를 기반으로 Alpaca dataset 의 query 를 complicating 한 것이고, 높은 diversity 와 complexity 를 보인다.
(2) The larger size, the more diverse and more complex
(3) Math and Code show different trends : MATH, GSM8K 같은 수학 관련 데이터셋이나, DMCC, MBPP, DrRepair 같은 코드 생성 관련 데이터셋은 낮은 diversity 와 높은 complexity 를 보인다.
(4) Diverse and complex data induces higher performance : upper-right corner 에 위치한 ShareGPT, UltraChat, OpenChat-v1 등의 데이터셋은 finetuning 에 활용되었을 때 leaderboard 상단에 위치한다.

Open-source dataset 들 사이의 correlation 은 오른쪽 Figure 에서 볼 수 있다. 두 가지 결론을 낼 수 있다.

(1) Tags can identify different tasks : 수학/코드 task 가 다른 task 에 비해 높은 tagrecall 을 보인다. tag 가 general-purpose dataset 에 비해 수학/코드 데이터셋의 uniqueness 를 부여하는 것이다.
(2) Few cover all : WizardLM (Alpaca), WizardLM (SharGPT), UltraChat, SharGPT 는 다를 데이터셋에 비해 매우 높은 tag recall 을 갖는다. 이것들은 왼쪽의 그림에서도 upper-right 에 해당하는 좋은 데이터셋들이다.

두 가지 outlier 도 발견할 수 있다. 하나는 Alpaca 로 큰 data size 를 지녔음에도 낮은 performance 와 low complexity 를 보인다. 다른 하나는 OpenChat-v1 으로 filtering 과정 이후 단 8K 개의 multi-turn 만 남은 small data scale 임에도 높은 complexity 와 높은 diversity 를 보인다.

4. INSTAG FOR DATA SELECTION

4.1. EXPERIMENTAL SETUP

INSTAG 방법을 활용한 Data selection 을 진행하여 Data 를 모은다. (위의 section 3 에서는 open-source dataset 에 적용하여 분석을 진행한 것이고 여기서는 새로 모은다.)

Data Pool
위의 Figure 에서의 분석을 토대로 WizardLM (Alpaca), WizardLM (ShareGPT), UltraChat, ShareGPT 에 적용한다. 적용된 이후 dataset 은 306,044 sample 과 6,398 tag set, avearege tag number 4.48 을 갖는다.

Data Sampling
Pooled dataset 에서 가장 높은 complexity 를 보이는 6K 를 고른다. 이 6K sample 은 16.56 개의 average tag number 를 갖고, 100% 의 tag coverage 를 보인다. 이 Complexity-first Diverse Sampling 알고리즘은 아래와 같다.

Configuration
6K sample 을 LLaMA 와 LLaMa-2 에 적용하여 각각 TAGLM-13b-v1.0, TABLM-13b-v2.0 으로 이름 붙인다. Batch size 는 128, lr 은 2e-5 이고, finetuning 동안 Vicuna-style 의 template 으로 query-response 를 학습시킨다.

Baselines

Closed-source : GPT-4, GPT-3.5, Claude-V1
Open-source : Vicuna, WizardLM, Baize, OpenChat, Alpaca

4.2. RESULTS

TAGLM-13b-v1.0 이 단 6K sample 을 finetuning 했음에도 모든 open-source LLM 을 능가한다.

4.3. DECOUPLED ANALYSIS

우선, data size 가 미치는 영향을 알기 위해, Cf-D 알고리즘으로 data size 를 달리하며 data selection 을 해본 뒤 측정한다. 위 Table 의 위쪽에서 볼 수 있듯이, 6K 일 때 가장 좋고, 10K, 16K 일 때는 떨어지지만 여전히 다른 open-soruce LLM 보다는 좋다. 이는 LIMA 에서의 finding 과 같이, small scale 이지만 매우 좋은 퀄리티의 데이터 를 학습하는 것이 중요하다는 결과와 일치한다.

또한, 밑의 Random 과 비교했을 때 같은 6K sample size 에서 5.76 에 비해 무려 0.68 이나 증가한 6.44 가 된 것으로 보아 Complexity 를 우선으로 하는 Cf-D sampling 기법이 효과적임을 볼 수 있다.

5. INSTAGGER: LOCAL TAGGER BY DISTILLATION

INSTAG 방법은 ChatGPT 를 활용하기 때문에 large-scale application 을 위해서는 expensive 하다. 이에 저자들은, INSTAGGEr 라고 불리는 distllation model 을 공개한다. LLaMA-2 7B 모델 버전인 이 모델은 EM 기반 F1 score 와 semantic-based fuzzy match 에서 각각 31.8%, 73.4% 를 기록한다. 6K 가 넘는 tag 중 정확히 맞춰야하는 EM 은 rigorous 한 metric 임에도 31.8% 라는 높은 성능을 보였고, PhraseBERT 를 통해 계산한 fuzzy match 는 gold tag 와 0.8 이상의 cosine 유사도를 보이는 것을 맞는 것으로 측정한 결과로 73.4% 의 높은 성능을 보인다.

6. CONCLUSION

In this paper, we introduced INSTAG, an open-set tagging method leveraging the instructionfollowing ability of ChatGPT for SFT data analysis. We apply INSTAG on open-source SFT datasets, showing diverse and complex data leads to better alignment performance. We designed a complexity-first diverse sampling method to select 6K samples, and TAGLM fine-tuned on this selected dataset outperforms other open-source models aligned with considerably more data. Moreover, further decoupled analyses revealed that model performance increases with fine-tuning on more diverse and complex SFT data, respectively. In summary, our proposed INSTAG provides a novel aspect for a deeper understanding of query distribution in the alignment of LLMs. It has robust potential to be extended to more applications beyond the data selection shown in this work, such as creating comprehensive evaluations and tag-based self-instruct.

[Arxiv 2404]HyperCLOVA X Technical Report

Fri, 05 Apr 2024 11:40:00 +0000

[pdf] [hyperclobax]

NAVER Cloud
HyperCLOVA X Team

Abstract

(HyperCLOVAX) 한국어와 한국문화에 학습된 LLM인 HyperCLOVAX 를 소개한다. 한국어와 영어, 그리고 코드 데이터셋을 학습하여 특화되어있다.
(Evaluation) Comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness 등 많은 benchmark 에 대해, 한국어와 영어 모두 실험을 진행하였고, 한국어에서 매우 강력한 reasoning 능력을 보여준다.
(Multilingualism) 한국어-영어 bilingual 특성 뿐 아니라, Multilingualism 로의 확장으로 기계 번역 등 다양한 언어로의 확장 가능성을 제시한다.

1. Introduction

▶ Bias in English Corpus
현재 다양한 LLM 들이 매우 좋은 성능을 보여주고 있지만, 대부분 North American culture 와 영미권 문화에 강하게 bias 가 되어있다. 이는 pretrianing corpus 가 대부분 영어로 되어있기 때문이다. 따라서 한국어와 같은 non-English 언어에 대해서는 특정한 문화나 지리적인 특성 등을 반영하지 못하여 매우 압도적인 성능을 보여주지 못한다.

▶ HyperCLOVA X
이에 저자들은 HyperCLOVA X family 를 공개한다. 이는 강력한 버전인 HCX-L 과 lightweight 버전인 HCX-S 로 구성되어있다. 두 모델 모두 한국어와 한국 문화적인 내용에 맞춰져 있으며(tailored), 영어 외의 다양한 언어에 대하여 좋은 성능을 보인다. 모델들은 한국어, 영어, 그리고 코드 데이터셋에 공평하게(evenly) 학습이 되었다.

▶ Reasoning Capability
HyperCLOVA X 모델은 reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, harmlessness 등 9개의 task 에 대하여 한국어/영어에서 매우 좋은 성능을 보인다. 특히 현존하는 closed-source 와 open-source 를 모두 포함하여, 한국어에 대해서는, 기존 모델들을 뛰어넘는 포괄적인 이해능력을 보여준다.

▶ Multilingual Capability
또한, 한국에서 자주 사용되는 세가지 다른 언어에 대해 기계번역을 통한 cross-lingual reasoning 능력을 실험하였을 때, state-of-the-art 수준의 machine translation 성능을 보인다. HyperCLOVA X 의 이러한 인상적인 multilingual ability 는 한국어-영어의 cross-lingual trasnfer 에 대해, 하나의 언어에 대한 instruction tuning 이 다른 언어에 대하여 intruction-following 능력을 나타내는 emergent ability 를 보인다.

▶ Safety
Safety 에 대한 보장을 위해, red teaming 기법을 활용하였고, safety data collection process 가 NAVER AI Ethics 원칙에 강하게 기반되었다. 다양한 safety evaluation (automatic & human evaluation) 으로 안정성을 보장한다.

2. Training Details

HCX-L 과 HCX-S 모두 한국어/영어/코드 데이터셋에 pretraining 된 이후, Supervised Fine-tuning (SFT) 와 reinforcement learning form human feedback (RLHF) 를 통해 instruction-following ability 가 향상되었다.

2.1. Pretraining

HYPERCLOVA X 는 HYPERCLOVA 의 updated version 이며, trasnformer decoder 에 약간의 modification 이 추가된 버전이다. Context Length 향상을 위해 position embedding 으로 rotary position embeddings 을 활용하였고, pre-normalization 과 grouped-query attention 을 사용하였다.

Data
Pretraining data 는 한국어(Korean), Multilingual, Code segment 로 이뤄져 있다. Multilingual 은 대부분 영어로 이뤄져있지만, 일본어, 독일어, 프랑스어 등 다양한 언어로도 이뤄져 있고, 한국어에 특화시키기 위하여, 한국어 데이터셋을 전체 데이터 크기의 3 분의 1 이 되게 확보하였다. 결과적으로, 한국어, multilingual, code 데이터 세 개가 equal distribution 을 갖는다. 데이터 퀄리티를 위하여 반복적인 문장, 너무 짧은 문장, 너무 낮은 퀄리티의 document 는 제외하였고, Personallyh identifiable information (PII); 개인 정보등은 제거하였다. 또한, Knowledge-containig data 를 upsample 하여 performance 향상을 이끌어낸다.

Tokenizer
한국어 중심의 LLM을 위해 효과적인 Tokenizer 준비하는 것이 중요하다. 한국어는 어근 의미 형태소에 문법 형태소를 붙여 단어를 형성하는 응집형 언어이다. HyperCLOVA X 는 형태소 인식 byte-level BPE를 훈련하여 한국어 문서를 효율적으로 토큰화한다. 아래 표에서 한국어에 강력하게 효율적임을 볼 수 있다.

Pretraining Scheme
Left-to-Right 에 한정짓지 않고, PSM & SPM training 을 활용한다. (fill-in-the-middle 방법이다) 이 학습 방법은 pre-training 동안 in-filling performance 를 얻기 위해서 고안된 것이다. 90% 학습은 4096 context length 로 학습하고, 나머지 10% 는 32768 length 로 학습한다. 또한 flash attention 과 3D parallelism 을 활용하며, bf16 precision 을 활용한다.

2.2. Alignment Learning

2.2.1. Supervised Fine-tuning (SFT)

각각의 prompt 에 대하여 completion 의 likelihood 를 maximize 하게 SFT 를 통한 alignment learning 을 한다. 이를 통해 instruction-following, problem-solving, coding, creative writing 능력 등을 향상시킨다.

2.2.2. Reinforcement Learning from Human Feedback (RLHF)

SFT 만을 이용한 Alignment tuning 이 uninformative 하거나 harmful content 를 포함하는 것은 이제 공공연한 사실이다. 이를 위해 대부분 RLHF 는 3H value 인 helpful, honest, harmless 를 학습시킨다. HyperCLOVA X 는 Proximal Plicy Optimization (PPO) 를 활용하였다.

Reward Model.
SFT 학습이 끝난 모델에, random 하게 init 된 linear head 를 붙여 scalar reward 를 내뱉게 한다. 모델은 Bradley-Terry model 에 기반한 ranking loss 로 학습되는데, 이는 chosen 과 rejected 의 차이를 reward negative log-likelihood 를 최소화하는 방법이다. 이 모델은 한 에폭만 학습된다. (InstructGPT 논문에 기반)

Reinforcement Learning
다른 모델들과 유사하게 PPO 를 활용하였고, KL penalty term([1],[2])을 0.04 계수 와 함께 reward 에 추가한다. Policy Network 는 post-SFT model 이고, reward model 은 앞서 언급한 모델이다.

많은 기존 연구(AlpacaFarm, [3], [4])에서 RLHF 이후 output length 의 증가를 report 하였다. 저자들 또한 같은 현상을 목격하였고, model 이 longer sequence 를 좋아하는 경향을 알아낸다. 이를 해결하기 위해 iterative human feedback 방법을 고안한다. 또한, 특정한 length 와 format 에 한정된 instruction set 에 overfitting 되지 않기 위해, early stopping mechanism 을 추가하였다.

또한, Transformer 기반의 LLM 은 repetition 에 취약하다.

저자들은 역시 이 문제도 발견하였고, PPO 에 sequence-level unliklihood training 를 추가하여, 최소한의 추가적인 training cost 로 repeition 문제를 해결하였다.

PPO 의 경우, 통상적으로 SFT 보다 네 배의 시간을 요구한다. 이 과정을 optimize 하게 위하여, multi-node setting 으로 asynchrnous processing 을 통해 process 를 병렬화한다. 특히 각 iteration 의 rollout phase 에서 네 개의 네트워크에 inference 를 하기 위한 continous batching 을 employ 한다.

2.2.3. The Alignment Learning Pipeline

특정 checkpoint 에서 model 의 training 을 interuppt 하는 대신, check-point saving event 를 발견하고, 다른 computation resource 에서 asynchrnous 하게 evaluate 하는 event-driven pipeline 을 통해 효율적인 학습을 진행한다.

또한, SFT, RM, PPO learning process 를 하나의 스텝 이후에 자동적으로 시작되게하여 human intervention 을 최대한 줄인다.

3. Core Benchmarks

Benchmark Design.

Multilingual 언어 모델의 발전에서 큰 constraint 는 영어 이외의 언어에 대한 철저한 평가 프레임워크의 부재이다. 특정 언어의 능력은 linguistic proficiency 뿐만 아니라 해당 언어 사용자에게 독특한 문화적 및 사회적 뉘앙스에 대한 깊은 이해도 필요하다. HyperCLOVA X 의 언어 능력을 평가하기 위해, 내/외부적으로 찾은 영어와 한국어 벤치마크를 활용한다.

Reasoning, world knowledge, and mathematics transcend language 과 같은 핵심 역량은 언어를 초월하기 때문에(언어에 특화되지 않아도 되므로), 이런 벤치마크의 상당 부분은 언어 중립적 기술을 평가하기 위해 영어로 진행된다. 한편, 언어별 질문에 대한 다양한 측면을 모델이 얼마나 잘 포함하는지와 문화적 뉘앙스를 다루는 모델의 능력을 평가하기 위해, 각 언어에 맞게 구성된 두 가지 상세한 벤치마크 카테고리를 활용한다.

또한, 한국어 데이터셋은 기계 번역된 것을 활용하지 않고, 전문가에 의해 세심하게 제작된 것을 활용하거나 이미 그렇다고 인정받은 것들을 활용한다. 이러한 벤치마크에는 KoBigBench (KBB)와 같은 지역 특화 질문과 내부 노력에서 구축된 포괄적인 한국어 벤치마크인 KMMLU 내의 한국어 특정 질문 세트가 포함되어 있어 모델의 한국 문화 및 사회적 맥락 이해를 엄격하게 평가한다.

Baselines.

HyperClOVA X 는 한국어와 영어 모두에 내재적 효율성을 위해 학습되었기 때문에, 그 평가 역시 counterpart 와의 직접적인 비교가 어렵다. 따라서, 한국어 유창성에 관련한 비교는 한국어특화 LLM 들과 비교하고, langauge-agnostic task 에 대해서는 일반적인 foundational model 들과 비교한다. 한국어 평가를 위해, Korean LLM community 에 만연한 비교 방법인 Korean corpus 로 학습 된 후 target language 에 적용하는 방법으로 closed-, open- source LLM 들과 비교한다.

Models Specializing in Korean : (1) Polyglot-Ko(TUNiB), (2) [SOLAR

SOLAR-chat(Upstage)](https://arxiv.org/pdf/2312.15166.pdf) (LLaMa2 아키텍쳐에 Mistral parameter 로 init), (3) LLaMa2 Ko

LLaMa2 KoEn(huggingface), (4) KORani(Krafton-ai), (5) EEVE-Korean-v(yanolja) (SOLAR 에 한국어를 위한 효율적인 vocab 활용한 모델)

General Foundation Models : (1) Falcon, (2) LLaMA2, (3) Mistral 7b

Evaluation Methods.
두 가지 main evaluation method 를 택한다.

(1) Open-ended question-answering free-form answer ( BigBench-Hard )

※ 자세한 세팅은 논문 참고

(2) Closed-ended question-answering candidate answer

모든 벤치마크의 전체적인 결과는 아래의 Figure 와 Table 에서 볼 수 있다. 각각의 항목에 대해서는 차례대로 알아본다.

3.1. Comprehensive Korean LLM Benchmarks

KoBigBench(KBB) : zero-shot
KMMLU : MMLU의 번역본이 아닌 한국 문화와 언어를 반영한 MMLU ; 5-shot
HAE-RAE Bench : Benchmark designed to challenge models in Korean cultural and linguistic knowledge ; 다음 네 개의 도메인으로 이뤄져 있다: vocabulary, history, general knowledge, and reading comprehension; zero-shot
Results

한국어에 HCX 가 매우 강력하다
This underscores the assertion that for language and region-specific Large Language Models (LLMs) to be successful, the acquisition of large-scale, high-quality data from the target group is crucial.

3.2. Comprehensive English LLM Benchmarks

MMLU (Massive Multi-task Language Understanding) : 5-shot
BBH (BigBench-Hard) : 200개 task 에 달하는 Bigbench 중 어려운 23개 task ㅁ나 모은 것으로 SOTA model 이 human performance 를 넘지 못한 것들만 모아놓은 벤치마크; 3-shot
AGILEval : human-centric standardized exams, such as college entrance and lawyer qualification exam ; zero-shot
Results

위의 Table4 에 결과가 있다. (오른쪽 English)

영어에서 LLaMA2 와 거의 유사한 성능을 보인다.
CoT 와 Self-consistency 를 쓸 경우 HCX 는 70.79로 성능이 증가하지만, LLaMA2 70B 는 오히려 66.65 가 떨어진다.

3.3. Commonsense Reasoning

Hellaswag : 인간에게는 쉬운 commonsense reasoning 을 다루는 task; 5-shot
Winogrande : cloze-style pronoun resolution problem ; 5-shot
PIQA : Physical Interaction Question Answering ; zero-shot
AI2 Reasoning (ARC) : grade-school level question-answers in two (easy and challenging) varieties; 25-shot
CommonsenseQA (CSQA) : 5-shot
Results

WinoGrande 와 CSQA 에서 주목할만한 성능을 보인다. 그러나 Mistral 의 further training 버전인 SOLAR 와 EEVE 가 Hellaswag 와 PIQA 에서는 더 좋은 성능을 보인다.

3.4. World Knowledge and Factuality

Natural Question (NQ) : open-ended fact-seeking questions; multiple candidate answer 중에서 하나를 선택; 5-shot
TriviaQA : 600K Question-Evidence-Answer triplet 의 large-scale Reading comprehension benchmark; 최근에는 Evidence 를 뺴고 inherent knowledge 를 평가하기 위해 Question-answer pair 만을 사용하는 경향이 있다 ;
CLIcK : linguistic and cultural intelligence in the Korean language 를 평가하는 따끈따끈한 벤치마크; zero-shot
Factscore : 한국어 Wikipedia 에 맞게 prompt 들을 조금 손보았다;
Results

NQ 와 TriviaQA 는 서양 문화를 기반으로 collect 되었기 때문에 HyperCLOVA X 가 잘못한다.
KORani 와 EEVE 는 각각 Mistral 과 LLaMA2 라는 영어 기반 모델을 further training 한 것이라 이 데이터셋을 잘 푼다.
반대로, LLaMA2 와 Polyglot LLM 은 한국어 문화에 대한 이해가 부족하지만, HyperCLOVA X 와 EEVE-Korean-V1 은 잘한다.

3.5. Mathematics

GSM8K : 초등 수준의 수학 문제; 8-shot
MATH : 4-shot
Results

GSM8K 에서 80점을 넘겨 다른 LLM 보다 월등히 우수한 성능을 보인다.
더 어려운 MATH 에서도 20점을 넘겨, 대부분 15점 미만인 다른 LLM 보다 우수한 성능을 보인다.

3.6. Coding Capabilities

HumanEval
MBPP
K-HumanEval : Clova 팀의 in-house dataset ; HumanEval dataset 을 기계 번역과 manual review 로 한국어로 만든 것
Results

모든 데이터셋과 메트릭에서 앞서고, 특히 K-HumanEval 에서는 매우 압도적으로 좋은 성능을 보인다.

3.7. Chat and Instruction-Following

MT-Bench : writing, extraction, stem, coding 을 포함한 multi-turn query 구성된다.
Ko-MT-Bench : MT-Bench 를 한국어로 번역한 후, internal review 로 수정한다. “Start every sentence with the letter A.” 를 “모든 문장의 시작을 ‘하’로 해줘.” 등으로 수동으로 고친다.

참고 : LLM-as-a-judge

SuperNatural Instruction (SuperNI) : 119task - 10 instance per sample.
KoIF : CLOVA 내부적으로 만든 한국어 instruction-following test set; 18개 dataset 에서 뽑아낸 32 task - 600 instance
Results

HyperCLOVA X 와 EEVE 10.8B 를 제외하고는 대부분의 open-source LLM 이 Ko-MT 에서 성능이 좋지 못하다.
LLaMa2 의 경우, Question 이 한국어여도 98%의 경우 영어로 답하는 language confusion 이 있는데, judge LLM 이 이 mismatch 에 상관없이 평가한다.

3.8. Harmlessness

TruthfulQA : 흔한 misconception 과 false belief 로 인해 잘못 답변할 만한 문제들을 모아놓은 벤치마크; 이 벤치마크로 Pretraining 시 인간이 만든 모text 를 학습하여 잘못 답변하는지 검사할 수 있다; multi-answer multiple-shoice question set 을 구성(mc2)
Bias in Open-Ended Language Generation (BOLD) : LM 의 generation result 에 있는 social bias 를 측정하는 benchmark; Gemini 의 open version 인 Gemma 에서 채택됨;
Results

모델인 크면 클수록 높은 safety level 을 보인다.
※ 자세한 Harmlessenss 에 대한 분석은 뒤의 section 5 에 나온다.

3.9. Comparison with Closed Source Models

GPT-3.5, GPT-4, SOLAR API 세 개의 closed-source model 과 비교한다. Upstage社의 SOLAR 는 open-source 와 closed-source version 이 있는데 exact technical difference 는 unclear 하다.

Results

한국어에서는 비교불가의 압도적인 성능을 보인다. 이는 이미 KMMLU dataset (24년 2월)이 공개될 때 입증된 것이다.
영어에서는 GPT4와 competitive(?) 하다.(64.26 vs 53.51 로 조금 차이나는 것 같긴하다) 한국어-영어 bilingual user 에게는 67.39 vs 67.06 으로 GPT-4 와 거의 유사하게 사용할 수 있다고 주장한다.

Detailed results on HAE-RAE Bench

General Knowledge(GK) 를 제외한 나머지 모든 area 에서 압도적인 성능을 보인다.

4. Multilinguality

HyperCLOVA X 는 한국어/영어/코드 데이터셋으로 학습이 되었지만, 다른 많은 언어를 지원한다. 이 장에서는 HyperCLOVA X 의 multilinguality 를 (1)cross-lingual reasoning, (2)machine translation, (3)cross-lingual trasnfer 로 측정한다.

4.1. Cross-Lingual Reasoning

Asian Language 로 테스트한다.

XNLI

중국어(2등)를 제외한 나머지 언어에서 1등을 기록한다.

Cross-Lingual CommonsenseQA (X-CSQA)

역시 중국어(2등)를 제외한 나머지 언어에서 1등을 기록한다.

4.2. Machine Translation

FLORES+ : 영어, 중국어, 일본어 (한국에서 가장 많이 사용되는 언어들) 로의 번역 성능; 1-shot; Metric 은 xCOMET 이다(다른 metric 보다 human correlation 이 높다).

역시 중국어(2등)를 제외한 나머지 언어에서 1등을 기록한다.

4.3. Cross-lingual Transfer

영어와 한국어 사이의 corss-lingual transferability 를 평가한다. 하나의 언어에서 instruction-tuning 을 진행한 후, 다른 언어에서 instruction-following ability 를 실험한다.

LIMA
OpenOrca
Metric : ROUGE-L, LLM-as-a-judge (※ 자세한 내용은 논문 참조)

Impact of language ratio on instruction-tuning
English only (1:0) 에서 Korean only (0:1) 까지 한국어의 비율을 늘려가며, instruction-tuning 의 language ratio 의 영향을 조사한다.

Figure (a), (b) 에서처럼 적은 비율의 한국어 데이터셋을 instruction-tuning 학습해도 좋은 한국어에서의 성능을 보인다.(0.5% 한국어로 학습하고 나머지는 영어로해도, 전부 영어보다 Rouge-L 이 13점이나 높아졌다).
Figure (c), (d) 처럼 train-test distribution 이 다를경우 noise 가 좀 보인다.

Cross-lingual instruction-following

Baseline models : HCX-S, HCX-L< Mistral7B, Yi-Ko-6B
Metric
Results

전체적인 평균값이 HCX 가 제일 높고, language 의 type 과 ratio 에 관계없이 linguistic performance 를 유지한다. (Mistral 의 경우 한국어로 instruction-tuning 할 경우, 거의 수행능력이 없어진다 (Ko->En))

5. Safe and Responsible AI

5.1. HyperCLOVA X Ethics Principles

5.2. Red Teaming and Safety Data Collection

Ethics Principle 외에도 “social issues and biases”, “illegal activities”, “sexual matters”, “professional advice” 와 같은 hazardous topic 들이 존재한다. 이에 저자들은 harmlessness-helpfulness trade-off, role-playing, false premises, jailbreak 같은 attack method 를 활용하여 다양한 red teaming query 들을 모은다.

Data collection pipeline 은 위의 그림과 같다. Annotator 가 attack scenario 를 상정하고 질문을 하면, 여러 HCX 모델들이 답변을 한다. Annotator들은 이 답변들에 harmelessness score 와 overall score 등의 점수를 매긴다. 이 답변들의 overall score 는 RHLF training 을 위한 ranking pair 로 구성된다.

Scoring 이 끝났을 때, 점수가 완벽했던 답변이 없다면 Annotator 들이 sefe, helpful, correct 한 새로운 답변을 직접 작성한다. 이 완벽한 답변은 SFT 데이터셋으로 활용된다.

5.3. Safety Evaluation

한국어와 영어에 대해 평가가 이뤄지면, 대부분의 LLM 이 Alignment tuning 을 통해 harmlessness 를 학습하므로, open-source LLM 들을 baseline으로 하여 비교평가한다.

5.3.1. Toxicity

Toxicity 평가는 Perspective API 로 진행한다.

RealToxicPrompts (RTP) - ENGLISH
Korean Offensive Language Dataset (KOLD)

Results

Bias Benchmark for Question Answering (BBQ)

Korean Bias Benchmark for Question Answering (KoBBQ)

5.3.3. Human Evaluation

Rem Teamer 의 Attack Success Rate (ASR) 와 human preference 로 human evaluation 을 진행한다.

HCX-S is safer than HCX-L , and the safety preference of HCX-S is on par with GPT-4 (다른 safety 은 모델의 크기가 크면 클수록 좋다고 했는데 HCX-L 가 지는 이유는 모르겠다)

Attack Category

Conclusion

HyperCLOVA X represents a significant advancement in LLMs, particularly emphasizing the Korean language and culture while maintaining strong capabilities in English and other languages. Through a training process that incorporated a balanced mix of Korean, English, and programming languages, followed by supervised fine-tuning and reinforcement learning from human feedback, HyperCLOVA X demonstrates exceptional proficiency in a variety of tasks.

HyperCLOVA X’s performance across a wide range of benchmarks—e.g. reasoning in Korean and English, and problem-solving in coding and math—showcases its capacity and versatility. Also, its impressive multilingual ability, especially in cross-lingual reasoning and machine translation, further illustrates its generalization capability and the potential for broad application across different linguistic contexts.

Moreover, the commitment to responsible AI development and deployment is manifested through the extensive safety evaluations and adherence to ethical principles. HyperCLOVA X’s sophisticated handling of toxicity, social biases, and other ethical concerns through systematic red teaming and safety data collection processes, along with its performance in human evaluation studies, highlight its potential as a safe and reliable AI assistant. Overall, HyperCLOVA X sets a new standard for bilingual and multilingual LLMs, paving the way for more inclusive and culturally sensitive AI technologies.

As future work, we intend to explore multimodality, aiming to broaden HyperCLOVA X’s capabilities to seamlessly process and integrate diverse types of data, such as text, images, and audio. Moreover, we are set to explore the efficacy of model quantization techniques, with the goal of optimizing HyperCLOVA X ’s inference without sacrificing its accuracy or the quality of the output. Additionally, we are actively researching the integration of external tools and APIs to augment the model’s functionalities. This will enable HyperCLOVA X to access specialized datasets and services, significantly enriching and enhancing the factuality of its responses. Our team is committed to integrating these innovative research topics with the existing and future services at NAVER and its subsidiaries as we strive to advance AI technologies that benefit humanity

[ICLR2024] DP-OPT: MAKE LARGE LANGUAGE MODEL YOUR PRIVACY-PRESERVING PROMPT ENGINEER

Wed, 03 Apr 2024 07:30:00 +0000

[pdf] [github]

Junyuan Hong¹, Jiachen T. Wang², Chenhui Zhang³, Zhangheng Li¹, Bo Li⁴, Zhangyang Wang¹
¹ University of Texas at Austin, ² Princeton University, ³ MIT, ⁴ University of Chicago

Abstract

(Privacy issue in LLM) LLM 은 prompt tuning 을 통해 많은 task 에서 압도적인 성능을 보여준다. 그러나, 민감한 개인 정보에 dependency 가 있는 경우 문제가 생길 수 있다. 하나의 방법은 local LLM 을 host 하여 prompt 에 녹이는 방법이지만, closed-source 일 경우 불가능하다.
( DP-OPT ) 이 논문에서는 Differentially-Private Offsite Prompt Tuning (DP-OPT) 라는 방법론을 통해 문제를 해결한다. 이 방법론은 client side 에서 prompt 를 처리하고, 이 처리된 discrete prompt 를 cloud model 에 보내서 학습을 하는 방법이다. 저자들은 이 방법론이 성능 타협 없이 prompt 를 cloud model 에 잘 전달함을 보인다.
(Differentially-private (DP) ensemble) Prompt 가 개인 정보를 누출(leak)하지 않음을 보장하기 위하여, private prompt generation 메커니즘인 Differentially-private (DP) ensemble 방법을 제안한다.
(Experiment) DP-OPT 방법은 Vicuna-7B 를 통해 privacy-preserving prompt 를 쓰면서도, (private 정보를 쓰지 않은) GPT3.5 혹은 local private prompt tuning 방법과 유사하거나 좋은 성능을 보인다.

1. INTRODUCTIONS

▶ Prompt Engineering
Large Language Model (LLM) 이 강력한 pre-training 으로 방대한 task 에서 매우 압도적인 성능을 보여주지만, prompt engineering 은 cost-efficient 하게 downstream task 에 adatable 하게 할 수 있는 방법이다. Model parameter 를 resource-heavy 하게 optimize 하는 대신, prompt engineering 은 API access 등을 통해 prompts 만을 iteratively refine 해주면 된다. Manual Prompt Engineering 은 많은 task 에서 매우 인상적인 성능을 보여줬지만, legal judgement, healthcare, art 등의 전문가적인 downstream task 에 대해서는 domain knowledge 에 기반한 prompt design 에 human experience 가 개입되어야 하는 단점이 있다. 이를 위해, data-driven prompt tuning 인 soft prompt tuning 이 고안되었고, 이 방법은 prompt 를 trainable embedding vector 로 표현한뒤 training instance 에 따라 embedding vector 를 refine 한다.

▶ Data Privacy Issue
그러나 prompt tuning 의 적용의 major한 장벽이 되는 것이 data privacy 문제이다. ChatGPT 와 같은 LLM API 에 prompt 를 넣을 때, privacy-sensitive 한 정보를 넣게 되면 문제가 발생한다. 예를 들어 1) Data Confidentiality (Confidential data 가 입력이 되는 경우) 나 2)Information Leakage (누출되면 안되는 정보가 누출되는 경우) 등이다. 이름, 주소, 전화번호 같은 개인정보가 pre-training phase 나 fine-tuning data 에 포함된다면, 특정 parmaeter 를 통해 retrieve 될 수 있다.

이 문제 해결을 위한 Straighforward 접근은 local device 에서 entire prompt process 를 진행하는 것이다. 그러나, GPT 시리즈와 같이 closed-source 모델의 경우, substantial cost 는 말할 것도 없이 local hosting 자체가 불가능하다.

▶ Differentially-Private Offsite Prompt Tuning (DP-OPT)

이 문제 해결을 위해 저자들은 Differentially-Private Offsite Prompt Tuning (DP-OPT) 방법론을 제안한다. 이 방법론은 LLM 으로 하여금 private and transferable prompt 를 cloud-hosted LLM 을 위해 가공할 수 있게 한다. 위의 그림과 같이, privacy protection 의 중요한 부분 (crux) 은 client 에서만 운용된다. Confidential datatset 으로, DP-OPT 는 적은 sample 만으로 local LLM 이 prompt 를 생성할 수 있다. 이 local assistant LLM 은 coud-based LLM 에 비해 상대적으로 매우 작다. 또한, 이 prompt generation process 는 Differentially-Private (DP) ensemble of in-context learning 으로 가능하다. 실험 결과, 여러 언어처리 태스크에서, open-source VIcuna-7B 에 tuned 된 prompt 가 closed-source 인 GPT-3.5 나 LLama-2 보다 강력한 성능을 보인다.

2. PRELIMINARIES

2.1. Large Language Models (LLMs) and Prompt Tuning.

GPT, Llama, OPT 와 같은 LLM 은 이전의 context 로 부터 다음 token 을 생성한다. 수식적으로는 conditional probability $p_{LM}^t(y|x)$ 를 생성한다. 여기서 $x$는 prompt 이고, $y$는 output, $t$ 는 temperature 이다. 이때, task intsruction 과 같은 front-end prompt $\pi$를 사용한다면, prompt tuning 은 $F(\pi,x)$ 에서 $\pi$를 potimize 하여, 최종적으로, $p_{LM}^t(y|F(\pi,x))$ 를 향상시키는 것을 목적으로 한다.

2.2. Differential Privacy

Differential Privacy 는 머신 러닝 알고리즘의 privacy guarantee 를 측정하는 de-facto gold standard 이다. 수식적으로, 특정한 space $X$ 에 대해, 두 개의 dataset $D,D’ \in \mathbb{N}^X$ 에 대해, 하나의 data point 부터 다른 data point 를 adding/removing 을 통해 생성할 수 있으면 두 데이터셋은 adjacent 하다고 한다. (e.g. $D=D’ \bigcup z$ for some $z \in X$)

이 definition 이 의미하는 바는, neighboring dataset 의 임의의 pair 에 대하여, DP 알고리즘은 구분할 수 없는(indistinguishable) output distribution 을 내뱉어야 하며, 데이터셋으로부터의 출력을 구분할 수 있는 adversary 를 방지할 수 있어야한다. 이 연구에서는 이 메커니즘 $M$ 이 prompt generation 알고리즘으로 사용된다.

3. METHOD

Assumptions
Cloud model 의 강력한 성능을 이용하기 위해, local client model 에서 prompt tuning 을 하는데 세 가지 가정을 한다.

1) Data Confidentialty : client 는 cloud-model 과 데이터를 공유하지 않는다.
2) Information Privacy : Tuned prompt 는 private info 를 누출하지 않는다.
3) Model Ownership : Cloud model 의 parameter 는 client 와 공유되지 않는다.

Threat Model
Private info 를 얻길 위하는 cloud vendor 를 adversary 로 정의한다. Adverasry 는 client 로부터 tuned prompt 만을 받아서 어떠한 LLM 이든 공격하고자 한다. 몇몇 연구에서 prompt 에서 private info 를 얻어낼 수 있음을 밝혔다.

Main Idea
Data confidentiality 와 privacy 를 보존하기 위해, 저자들은 Differentially-Private Offsite Prompt Tuning (DP-OPT) 를 제안한다. 이는 cloud model 로부터 data 와 prompt tuning 을 분리시키는 방법이다. 앞선 Figure 처럼, 1) Private Prompt Engineering 으로 localized model 에서 prompt $\pi$ 를 학습하고, 2) Prompt Transfer 로 public inference 를 통해 cloud model 로 prompt 를 deploy 한다.

이를 위해선 두 가지 major technical challenge 가 존재한다.

(1) How to engineer a model-transferable prompt?
(2) How to guarantee that the prompts do not leak private information?

3.1. TRANSFERABLE DISCRETE PROMPTS ENABLE OFFSITE PROMPT TUNING

Cloud model 로 prompt 를 transferable 하게 하기 위해서는, 어떠한 model-specific embedding 이나 tokenization 전략이 포함되지 않는 discrete prompt 가 필요하다. 최근 연구에서 discrete prompt 가 domain 에 걸쳐 자연스럽게 transferable 하다는 결과가 있다. Wen et al. 에서는 자신들의 PEZ 라는 방법을 통해 GPT-2 755M 에서의 soft prompt 가 GPT-2 1.3B 의 큰 모델이나 OPT 와 같은 다른 아키텍쳐에 쓰일 수 있음을 보였다. 그러나, 이러한 transfer 는 심각한 performance loss 를 가져온다. 위의 연구에서 밝힌 ppro trasnferability 의 주된 이유는 tuned prompt 의 incoherence 이다. 이는 방법론이 Semantic 한 것을 생성하지 않고 모델의 훈련을 촉구하기 위해서만 embedding space 에 여전히 크게 의존할 수 있음을 의미한다.

따라서 저자들은 semantically transferable prompt 를 찾기 위해 노력한다. Embedding space 의 함정을 피하기 위해, embedding space 에서 backward 하는 것이 아니라 fluent 하고 coherent prompt 를 찾기 위해 노력한다. Automatic Prompt Engineering (APE) 에 영감을 받아, LLM 이 ideal tool 을 스스로 찾게끔 한다. 잘 훈련된 LLM 이라면, APE 가 context 와 prompt smaple 을 입력으로 받아 fluent, coherent, (perhaps) transferable 한 prompt 를 생성하기를 바란다. 즉 다시 말해, LLM 이 해주길 바란다 (discrete prompts crafted by one LLM may transfer to another with target-model-dependent performance on the same task.)

Make LLM Prompt Engineer
최고의 성능을 위해 State-of-the-Art APE method 인 Deep Language Network (DLN) 를 사용한다. DLN 은 gradient-based optimization 을 mimic 하여 forward-backward 방식으로 prompt 를 학습한다. Forward pass 에서 prompt 를 생성하고, backward pass 에서 LLM 의 in-context example 을 통한 prediction 을 통해 $\pi$ 를 sample 한다. Candidate prompt set 에서 DLN-1 은 highest log prob 을 갖는 best prompt 를 선택한다.

LLM-Engineered Prompts Are Transferrable

Vicuna-7B 를 통해 DLN-1 으로 prompt 를 학습시켜보았다. 이후 더 크고 같은 형태의(homogenous-architecture) LLama-2-70B 와, closed-source model 인 Davinci-003 에 적용해보았다. 결과는 위의 표와 같이, DLN-1 은 target model 에 competitive performance 를 보인다. 심지어 Davinci-003 에 대해서는 8% 의 성능 향상도 얻는다.

실제 DLN-1이 생성한 prompt 의 예시는 아래와 같다.

3.2. DIFFERENTIALLY-PRIVATE OFFSITE PROMPT TUNING (DP-OPT)

Private Prompt Generation

Private Selection among Generated Prompts

4. EXPERIMENTS

Tasks

Setup

4.1. PRIVATE OFFSITE PROMPT TUNING

4.2. ABLATION STUDIES

Examples of Privacy Leakage in Generated Prompts

DISCUSSION AND CONCLUSION

With the rising popularity of prompt tuning, our research endeavors to extend this tool to applications with heightened privacy concerns. We introduce the pioneering end-to-end system designed to derive differentially-private prompts from confidential training datasets and deploy these prompts on cloud models. Our approach is underpinned by theoretical validations of its privacy assurances, and through empirical analysis, we highlight the advantageous balance it strikes between utility and data privacy caused by the strong performance of scaled LLMs.

초록색볼드체
초록색배경 빨간색배경

▶

[ICLR2024] LOFTQ: LORA-FINE-TUNING-AWARE QUANTIZATION FOR LARGE LANGUAGE MODELS

Mon, 01 Apr 2024 04:30:00 +0000

[pdf] [github]

Yixiao Li^1∗, Yifan Yu^1∗, Chen Liang¹, Pengcheng He², Nikos Karampatziakis², Weizhu Chen², Tuo Zhao¹
^∗ Equal contribution, ¹ Li, Yu, Liang, and Zhao are affiliated with Georgia Institute of Technology. Correspondence to yixiaoli@gatech.edu, yyu429@gatech.edu, and tourzhao@gatech.edu., ² He, Karampatziakis, and Chen are affiliated with Microsoft Azure.

Abstract

(Quantization and LoRA) LLM 을 Finetuning 하기 위해 필수불가결한 요소가 Quantization 이고, 최근 LoRA fine-tuning 기법을 통한 quantization 연구도 활발하다. 기존의 연구들은 quantization 과 LoRA 를 같이 적용하였을 때, full fine-tuning 과 비교하여 consistent gap 이 있음을 한계점으로 지적한다.
( LoftQ ) 이에 저자들은 LLM 을 quantize 하면서 동시에, LoRA fine-tuning 을 위한 proper low-rank initialization 을 찾는 LoRA-Fine-Tuning-aware Quatization, LoftQ 를 제안한다. 이 방법론은 full-precision model 과 quantized model 사이의 discrepancy 를 경감시켜 downstream task 에서의 generalization 성능을 향상시킨다.
(Experiment) NLU, QA, Summarization, NLG task 등에 적용하였을 때, 기존의 quantization method 보다 우수한 성능을 보이고, 특히 어려운 2-bit 이나 2/4-bit mixed precision regime 에서 강력한 성능을 보임을 확인한다.

1. Introduction

▶ LLM and their costs
Large Language Model (LLM) 이 자연어 이해 (NLU) 와 자연어 생성 (NLG) 에서, 다른 모델들과 (LLM 이 아닌 모델들)과 비교가 불가능할 정도로 압도적인 성능을 보인다. 그러나 그들은 extensive computational and memory cost 를 요구한다. 특히 Training 을 어렵게 할 뿐 아니라, deploying 이나 테라포밍 단계에서 매우 많은 resource 를 요구한다.

▶ Quantization and LoRA
이 extensive requirement 를 해결하기 위해, qunatization 이 pivotal compression technique 으로 많은 연구가 되고 있다. Quantization 기법은 high-preicison numerical value 를 discrete value set 으로 변환시키는 것이다. 보통 model 들이 16-bit float format 으로 저장되어있는 것을 4-bit integer format 으로 quantization 시키면 storage overhead 가 75% 나 줄어드는 것이다.

Low-Rank Adaptation (LoRA) 는 quantized pre-trained model 을 downstream task 에 효과적으로 adaptation 을 시킬 수 있는 매우 중요한 방법이다. 이 방법은 fully fine-tuned weight 과 pre-trained weight 의 차이는 low-rank property 를 보인다는 점을 가정한다. 이 가정으로 그 차이점을 low-rank matrix 를 활용해 표현한다. 그 결과, pre-trianed weight 은 고정한 채, low-rank matrix 만을 solely train 하여 효과적인 task adaptation 이 가능하게 한다.

기존에는 보통 pre-trained model 을 quantizing 할 때, 추후의 LoRA fine-tuning 의 중요성은 무시 한채, quantization 기술에만 집중하였다. 예를 들어, QLoRA 의 경우 LoRA 에서 사용되는 fixup initialization 을 상속받아(inherit), quantized pre-trained model 에 zero initialized low-rank adapter 를 붙인다. 이렇게 될 경우, 2-bit regime 같은 극단적인 low-bit situation 에서 qunatization 학습을 위한 bix approximation 방법이 LoRA finetuning 의 initialization 에 영향 을 미칠 수 있다. 아래의 그림 왼쪽(a) 처럼, QLoRA 의 quantized pre-trained model 은 3-bit level 이하에서 심각한 degradation 이 있다. 이 initialization 에서의 일탈(deviation)은 fine-tuning performance 에 큰 나쁜 영향을 미친다. 오른쪽 (b) 그림 처럼, QLoRA 를 적용하면 quantization bit 이 작아질 수록 fine-tuning performance 가 크게 감소한다. QLoRa 가 3-bit level 이하에선 실패하는 것을 보이는 것은 noteworthy 하다.

▶ LoftQ
이에 저자들은 LoRA-Fine-Tuning-aware Quantization (LoftQ) 방법론을 제안한다. 이 것은 pre-trained model 중에서 quantization 과 LoRA fine-tuning 을 모두 필요로 하는 모델을 타겟으로 한다. 이 framework 은 low-rank approximation 과 quantization 을 active 하게 통합한다. 이 시너지(synergy)는 아래 그림처럼 original pre-trained model 과 quantized model 사이의 discrepancy 를 크게 줄여준다. 결과적으로, 추후의 LoRA fine-tuning 을 위한 효과적인 initialization point 를 제공하여 downstream task 의 improvement 를 이끌어낸다.

▶ Experiments
저자들은 LoftQ framework 을 NLU, QA, Summarization, NLG 태스크들에 광범위하게 적용해본다. 그 결과, 4-bit quantization 에서 XSum 에서 1.1, CNN/DailyMadil 에서 0.8 gain 을 얻었다. LoftQ 는 특히 low-bit scenario 에서 효과적인데, 2-bit Normal float 과 2-bit uniform quantization 환경에서, MNLI 에서 8%, SQuAD1.1 에서 10% gain 을 얻었다.

2. Background

2.1. Transformer Models

Multi-head Attention (MHA) + Feed Forward Network (FFN)

※ 논문참고

2.2. Qunatization

Quantization
N-bit quantization : 32-bit floating point number 같은 high-preicision number $X^{HP} \in \mathbb{R}$가 주어졌을 때, N-bit integer $X^{INT} \in \mathbb{R}$ 로 변환하는 것이다.

$F(\cdot):\mathbb{R} -> [0,1]$ 은 normalization function 이다. Uniform Quntization 은 $F(X) = (X-X_{min})/(X_{max}-X_{min})$ 이다. QLoRA 에서는 4-bit NormlaFLoat Quantization (NF4) 방법을 제안한다. 이 것은 $X ~ N(0, \sigma^2)$ 을 가정하여 $F(X) = \Phi(X/\sigma)$, where $\Phi(\cdot)$ is cumulative distribution 가 된다.

Dequantization
아래의 Lookup Table $T$ 를 활용하여,

$X^{INT}$ 를 high preicision counterpart $X^D \in \mathbb{R}$ 로 변환한다. 따라서, dequantization 은 아래와 같이 표현된다.

Simulated Quantization for Matric
Matrix Multiplication 을 quantized representation 으로 direct 하게 적용하는 방법도 가능하다. 이를 simulated quantization for matrices 라고 하고, quantized weight matrix 들이 encoded integer 로 저장이 되고, high-precision matrix 을 simulate 하기 위해 dequantized 되어 활용된다. Simulated quantization 을 위해서는 high-precision matrix 부터 simulated high-precision amtrix 로의 mapping 만 필요하다.

2.3. Low-rank Adaptation

Low-Rank Adaptation (LoRA) 는 small weight matrix $A$ 와 $B$ 를 frozen pre-trained weight matrix $W$ 에 붙인다. 따라서 linear trasnformation $Y=XW$ 가 아래의 식으로 reformulate 된다.

$A$ 와 $B$ 의 init 은 pre-trained weight 과의 align 을 위해서이고, fine-tuning 때는 $W$는 fixed 된 채, $A$ 와 $B$ 만 SGD type 의 optimization method 를 통해 update 된다.

중요한 것은 만약 $A$ 와 $B$ 가 quantized backbone $Q=q_N (W)$ 에 붙여진다면, 위의 initialization 을 통한 $Q+AB^T$는 더 이상 pre-trained weight $W$ 와 같지 않아 discrepancy 가 생긴다.

3. METHOD : LoftQ (LoRA-Fine-Tuning-aware Quantization)

3.1. LoRA-Aware Quantization

$N$-bit quantized weight $Q \in \mathbb{R}_N^{d_1 \times d_2}$ 와 low-rank approximation $A \in \mathbb{R}^{d_1 \times r}$, $A \in \mathbb{R}^{d_2 \times r}$ 을 활용하여, original high-precision pre-trained weight $W \in \mathbb{R}^{d_1 \times d_2}$ 를 LoRA fine-tuning 의 initialization 으로 approximate 한다. 즉, Fine-tuning 전에 아래의 objective 를 최소화하게 network 를 initialze 한다.

여기서 $|| \cdot ||_F$ 는 Frobenious norm 이다. 이 objective 는 low-rank adapter $A$, $B$ 에 더불어, qunatized backbone $Q$ 의 init value 를 동시에 optimize 하여, 추후 LoRA fine-tuning 을 고려한 설계이다. 기존의 방법에서는 추후 LoRA fine-tuning 을 무시한채 $W$ 를 $Q$ 로 바꾸는 것만 신경썼고, 이러한 것은 notable degradation 을 불러온다.

3.2. Alternating optimization

저자들은 위의 objective 를 quantization 과 Singular value decomposition (SVD)를 번갈아가며(alternating) 최소화 문제를 푼다.

Quantization
$t$ 번 째 step 의 quantization 은 아래와 같다.

$q_N ( \cdot )$은 여러 quantization function 이 가능한데, QLoRA 와 같이 NF4 를 적용하였다.

SVD
$t$ 번째 quantization step 이후, SVD 를 적용한다. Quantization Residual $R_t = W - Q_t$ 에 대해,

로 SVD 를 적용한다. 이후, $A$ 와 $B$의 rank-$r$ approximation 을 $R_t$로 부터 얻는다.

지금까지의 과정을 아래의 알고리즘으로 정리할 수 있다.

$T=1$ 일 때는 QLoRA 의 $Q_1$과 정확히 일치한다. $T=1$ 만으로도 quantization discrepancy 를 줄이는데 효과적이지만, (즉 QLoRA 도 효과적이지만), alternating optimization 방법이 pre-trained weight $W$ 와 더 가까운 initialization 을 제공하여 성능 향상이 있음을 추후에 보인다.

3.3. Applying to LORA Fine-tuning

LoRA fine-tuning 때는 integer weight 은 고정하고 low-rank adapter 만 AdamW 로 학습한다. Forward pass 에서, interger weight 은 lookup table 을 통해 dequantization 이 된다. Backward pass 에서, gradient 와 optimizer 는 low-rank adapter $A$, $B$ 에만 적용된다.

4. Experiments

Quantization Methods

Uniform quantization
4-bit NF4 (Gaussian quantization)
2-bit NF2 (Gaussian quantization)

Baselines

Full fine-tuning
Full precision LoRA (LoRA)
QLoRA

4.1. Encoder-only Model : DEBERTa-v3

Models and Datasets

Model : DeBERTaV3-base
Benchmark : GLUE (w/o WNLI), SQuADv1.1, ANLI

Implementation Details

Learning Rates : {1e-5, 5e-5, 1e-4, 5e-4}
Quantize Entire Backbone and quantize the embedding layer for higher compression efficiency

Main Results

Table1 은 NF2 에 대한 결과, Table2 는 2-bit Uniform Quantization 에 대한 결과.
모든 rank, qunatization method, dataset 에 대해 QLoRA 보다 좋은 성능을 보인다.
Table2 의 MNLI-m 에서 88.0% 정확도를 달성하여, QLoRA 를 8% 이긴다.
NF2 의 SST 와 SQuAD 에서 full fine-tuning 과 유사한 결과를 보인다.
2-bit 에서 QLoRA 는 COLA 에서 실패하는데 비해, LoftQ는 60.5 로 높은 수치를 기록한다.

4.2. Encoder-Decdoer Model : BART

Models and Datasets

Model : BART-large
Benchmark : Summarization task XSum, CNN/DailyMail
Metric : ROUGE 1/2/L

Main Results

QLoRA 를 NF4 와 Uniform 에 대해 rank-8, rank-16 에서 모두 앞선다.
심지어 XSum 에서 Full precision 보다도 더 좋은 성능을 보인다. 이에 대한 분석은 뒤에서 진행한다.

NF2 quantization 에 대해, QLoRA 는 전혀 성능을 내지 못하지만, LoftQ 는 좋은 성능을 보인다.

4.3. Decoder-only Model : LLaMA-2

Models and Datasets

Model : LLaMA2-7b, LLaMA-2-13b
Benchmark : NLG task GSM8K, WikiText-2
Metric : Accuracy for GSM8K, perplexity for WikiText-2

Main Results

WikiText-2 에서 모든 setting 에서 QLoRA 보다 좋은 성능을 보인다.
역시 2-bit 에서 QLoRA 는 생성에 실패하지만, 7.85 ppl 을 달성한다.
GSM8K 에서도 QLoRA 는 생성에 실패하지만, 26.5% acc 를 달성한다.
Mixed-precision quantization scenario 에서 LoftQ 의 포테셜을 확인할 수 있다.

4.4. Analysis

Effectiveness of Alternating Optimization
Alternating optimization step $T$ 를 달리하며 실험 분석을 해본다. 앞서 말했듯, $T=1$ 일 때, QLoRA 와 동일하다. 모든 task 와 model 에 대하여, minimal alternating step 만으로 주효한 성능 향상이 있다. 이는 Quantized weight 과 original weight 사이의 discrepancy 를 rapid 하게 줄인다.

흥미롭게도, alternating step 이 너무 높으면 성능이 오히려 약간 낮아지는데 ($T=10$ 에서 MNLI, 그리고 XSUM 에서 $T$들) gap 이 작아질 수록 alternating step 이 gap 을 minimize 하는데 어려움을 겪기 때문이라고 분석한다.

5. Discussion

Start with quantization or SVD in the alternating optimization?
LoftQ 는 quantization -> SVD 순서로 alteranting optimization 이 구성되는데, SVD -> Quantization 으로 바꾸면,

아래와 같이 여전히 SVD 를 먼저해도 좋은 결과지만, 원래대로 Quantization 을 먼저하는 것이 조금 더 좋은 성능을 보인다.

LoftQ better than Full-precision LoRA?
Table3 와 Table5 에서, XSUM 에 대해 Full-precision LoRA 보다 LoftQ 가 더 좋았다. 저자들은 LoftQ 의 low-rank adapter 가 non-zero init 이고, Full-precision LoRA 는 zero-init 이기 때문에, 이러한 unexpected phenomenon 이 일어난다고 분석한다. 이 zero initialization 이 fine-tuning 을 unstable 하게 한다는 분석이다.

Conclusion

We propose LoftQ, a quantization framework for LLMs, which alternatively applies quantization and low-rank approximation to the original high-precision pre-trained weights, to obtain an initialization for the subsequent LoRA fine-tuning. Experiments on natural language understanding, question answering, summarization, and natural language generation show that our framework remarkably surpasses existing methods, e.g., QLoRA, for quantizing encoder-only, encoder-decoder, and decoder-only models. We have not observed our method exhibiting worse performance over QLoRA. Moreover, our quantization framework demonstrates effectiveness and robustness particularly in low-bit quantization regimes, e.g., the 2-bit level.

[Arxiv 2402] REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering

Fri, 29 Mar 2024 03:01:00 +0000

[pdf] [github]

Yuhao Wang^1∗, Ruiyang Ren^1∗, Junyi Li^1,3, Wayne Xin Zhao^1†, Jing Liu^4†, Ji-Rong Wen^1,2
¹ Gaoling School of Artificial Intelligence, Renmin University of China ² School of Information, Renmin University of China ³ DIRO, Université de Montréal ⁴ Baidu Inc.

Abstract

(RAG and Weakness) Internal parametric knowledge 의 한계를 극복하기 위해, Retrieval-augmented generation (RAG) 이 활발히 연구되고 있다. 그러나, RAG 에서 LLM 이 retrieved document 의 relevance 를 정확하게 평가할 수 없어, 오히려 RAG 가 옳지 못한 결과를 추출하는 경우가 있다.
( REAR ) 이 문제를 해결하기 위해, 저자들은 REAR:Relevance-Aware Retireval augmented approach 를 제안한다. LLM 으로 하여금 source relevance 의 self-awareness 를 발전시키고, 이를 RAG 시스템에서 잘 활용할 수 있게 한다. 구체적으로, rank head 를 이용하는 방법을 활용한다.
(Experiment) 네 개의 open-domain QA (ODQA) 에서 기존의 RAG 방법들 보다 압도적으로 뛰어난 성능을 보인다.

1. Introduction

▶ Retrieval-Augmented Generation (RAG)
Large Language Model (LLM) 이 여러 task 에서 좋은 성능을 보여주지만, open-domain QA (ODQA) 와 같은 knowledge-intensive task 에 대해서는 고전(struggle)하는 경향이 있다. 이를 위해, external knowledge 를 retrieval 해와서 generation 에 도움을 주도록 하는 Retrieval-augmented generation (RAG) 에 대한 연구가 활발하다.

그러나, RAG 를 활용한 점에도 단점이 존재한다. 첫째로, retrieved result 가 irrelevant content 를 포함하고 있 을 경우, LLM 을 mislead 하여 정확하지 않은 답변이 생성된다([1],[2]) 또한, 성능향상을 위해 여러 문서를 retrieval 해올 경우 noise 가 영향을 미치기도 한다.([3],[4]) 따라서, LLM 은 irrelevant document 를 filtering 하면서 noisy content 를 피해야하는 문제점에 직면하고 있다.

▶ Enhancing Robustness of RAG system
최근 여러 연구(SELF-RAG[5],SAIL[6],RobustLM[7]) 에서 RAG 의 robustness 를 발전시키려는 시도들이 있다. Self-RAG 의 경우, special token 을 활용해 document 가 relevant 한지를 discriminate 하여 generation 단계에서 활용 여부를 결정하고, RobustLM 은 document 가 relevant 한지 discriminate 하도록 LLM 을 prompting 한 이후 generation 을 진행한다. 그러나, 이 방식들은 document relevance 를 binary label 로 분류하기 때문에, highly sparse 하고 fine-grained relevance 를 capture 하지 못한다.

▶ REAR: Relevance-Aware Retireval-augmented Approach
이에 저자들은 REAR:RElevance-Aware Retireval-augmented Approach 방법을 제안한다. 이 방법론의 골자는 LLM 으로 하여금 source relevance 에 대한 self-awareness 를 발전시키고, LLM 이 스스로 external knowledge 를 adaptive 하게 활용하게 하는 것이다.

REAR 는 모델 아키텍쳐와 모델 학습 방법에 main contribution 이 존재한다. 우선, Information Retrieval (IR) 필드에서 성공한 reranker 방식을 이용하여, LLM 에 rank head 를 디자인하여 relevance assessment 를 하게 하여, irrelevant external knowledge 로 부터의 distraction 을 피하도록 돕는 relevance signal 을 capture 한다. 두번째로, coarse-grained signal 인 bianry discriminative method 의 한계점을 극복하기 위하여, bi-granularity relevance fusion 방식을 적용하여 fine-grained supervision 을 포함시키고, noise-resistant training 을 통해 discrimination ability 를 향상시킨다.

실험 결과, 네 개의 open-domain QA (ODQA) task 에서 REAR 가 좋은 성능을 보여준다.

2.1. Open-domain Question Answering

2.2. Retrieval-augmented LLMs

Jointly train LM and Retriever

Atlas
RA-DIT

Enhancing retrieval quality

Enhancing Robustness of RAG system

3. Task Formulation

[Task]

Open-domain Question Answering (ODQA)

[Notation]

query $q$
retrieved top-k document $D = [d_i]_{i=1}^k$
answer set $A = [a_i]_{i=1}^k = [LLM(q,d_i) | d_i \in D]$

[key aspect]

Identifying relevant references : precise evaluation of relevance between queires and documents
Reducing the influence of irrelevant content : leveraging relevance singal for noise-resistant generation

4. Methodology

4.1. Relevance-Aware RAG Architecture

모델 아키텍쳐의 그림은 아래와 같으며, (1) Relevance Assessment, (2) Relevance-guided Generation, 그리고 (3) Final Answer Routing 세 가지 파트로 이뤄진다.

(1) Relevance Assessment
첫번째로, Reranker 모듈을 활용하여 query-document pair 들의 relevance score 를 평가하여 reranking 한다. 구체적으로, LLM 에 rank head 를 도입하는데, 이것은 query 와 document 사이의 relevance signal 를 capture 하기 위한 장치이다.

우선 LLM decoder 가 query-document pair 를 input 으로 받아 embedding $v_{rel}$ 을 mapping 한다.

이후 $v_{rel}$ 은 RANKHEAD 를 통해 score $s_{rel}$ 로 변환된다.

여기서 RANKHEAD 는 linear projection layer 를 통해 구현된다. RANKHEAD 는 relevance assessment 에 특화되어 있으면서도, LLM 의 모듈을 건드리지 않는 장점이 있으며, $s_{rel}$ 은 LLM 의 interanal state 를 기반으로 직접적으로 optimize 될 수 있다.

(2) Relevance-guided Generation
REAR 는 LLM 으로 하여금 relevance assessment score 를 integrate 하도록 학습한다. $s_{rel}$ 은 scalar value 이므로, 이것을 LLM 에게 전달하기 위해 추가적인 embedding layer 를 구성한다.

이 임베딩 벡터는 LLM 으로 하여금 answer $a$ 를 생성할 수 있는 cue 가 되어, 아래와 같이 answer 가 generate 된다.

주목할 것은 $v_{rel}$을 굳이 학습에 사용하지 않고 $v_{guide}$를 학습시켰다는 점인데, 저잗르은 $v_{rel}$은 relevance assessment 에 집중하고, $v_{guide}$는 answer generation 에 집중하게 하기 위함이라고 설명한다. 추후 실험파트에서 둘을 나누는 것이, 즉 distinct 하게 가져가는 것이 효과적임을 보인다.

(3) Final Answer Routing
$k$ 개의 document $D = [d_i]{i=1}^k$ 에 대해, 위의 식들을 활용하여 $k$ 개의 answer set $A = [a_i]{i=1}^k$ 를 얻을 수 있다. 저자들은 여러 retrieved documents 로 부터의 generation process 를 navigating distinct reasoning paths 로 보고 가장 reliable 한 routing path 를 찾아 final answer 를 완성한다. 두 가지 routing strategy 를 취하는데, 하나는 path-reliability 이고 다른 하나는 knowledge-consistency 이다.

Path-reliability routing

가장 직관적인 방법으로 가장 높은 relevance score 를 갖는 answer 를 추출하는 것이다.

Knowledge-consistency routing

Chain-of-Thought 에서의 self-consistency([9],[10]) 에 영향을 받아, external knowledge 중 internal knowledge 와 가장 highly consistent 한 에 기반하여 answer 를 선택하는 방식이다.

우선, 위의 식처럼 proxy relevance $\hat{s_{rel}}$ 을 0 으로 설정하고 inverse perplexity 를 구한다. Relevance score 를 0으로 설정하였기 때문에, 이것은 internal knowledge 들에만 기반한 answer 를 생성한다. 만약, 이 경우에도 answer 의 perplexity 가 작다면, 해당 external knowledge 가 internal knowledge 와 유사한 것이다.

따라서 저자들은 위와 같이 특정 treshold $\gamma$ 를 넘는 document 들을 candidate 으로 하여,

위와 같이 knoweldge consistency score $c_i$를 활용하여 final answer 를 routing 한다.

4.2. Model Training

(1) Bi-granularity Relevance Fusion
Retrieved document 에 대한 믿을 수 있는 활용을 위해서는 정확한 relevance assessment 가 매우 중요하다. 기존의 방법들은 binary discrimination task 를 통해 coarse-grained 방식을 채택하기 때문에, complex ODQA task 를 풀기엔 충분한 evidence 를 제공하지 못한다(고 저자들은 주장한다.) 이에 저자들은 fine-grained ranking optimization objective 를 모델 학습에 추가한다.

우선, coarse-grained supervision 을 위해, document $D$ 들을 “irrelevant” 와 “relevant” 로 labeling 한 후 optimize 한다.

$\sigma$ 는 LLM 이 예측한 query-document pair 의 assessment prob 의 normalized 버전이다.

이후, fine-grained supervision 을 위하여, $s_{rel}$ 을 활용하여 ranking preference constratints 를 준다.

최종적으로, 모델 학습을 위한 bi-granularity relevance fusion 은 아래와 같다.

(2) Noise-resistant Training

Relevant document 를 identifying 하는 것을 넘어, 저자들은 irrelevant content 가 noise 로써 LLM 의 generation 을 방해하는 것을 방지하고자 한다. 저자들은 negative document set 인 $D^{-}$를 정의하여 아래의 식으로 LLM 을 optimize 한다.

최종적으로, 두 학습 방법을 모두 합친 REAR Framework 의 최종 loss 는 아래와 같다.

(3) Training Data Construction
High-quality training data 를 얻기 위해 relevance label 을 얻는 과정과 negative sample 을 얻는 과정이 필요하다.

우선, $s_{rel}$ 을 얻기 위하여서는 cross-encoder 아키텍쳐가 각각 query 와 document 에 적용되어 relevance score $s_{ce}$ 로 계산된다. 이후 traditional binary annotating label $y$ 와 함께, generated score 는 아래와 같이 주어진다.

그리고, irrevant document sampling 을 위하여, SimANS (Simple Ambiguous Negative Sampling) 를 정제하여 너무 쉽거나(uninformative하다), 너무 어려운 (false negative) negative sample 을 방지한다.

5. Experiments

5.1. Experimental Setup

Datasets : Natural Questions(NQ), TriviaQA, SQuAD, WebQuestions

**Baselines1

Retrieval augmentation based prompt methods** : Llama2-Chat, Mistral-Chat, Baichuan2-Chat, ChatGLM3

Direct Retreival Augmented QA, Judge-then-generate, Rank-then-generate

**Baselines2 Specially designed RAG methods** : Self-RAG, RobustLM
Metrics : JAcc, Hit, Exact Match(EM), F1

5.2. Main Results

REAR 는 coarse-grained 보다 좋으며 (높은 JAcc), fine-grained 보다도 좋고 (높은 Hit), 가장 좋은 generation 성능 (높은 EM 과 F1) 을 보인다.
같은 데이터셋을 학습한 RobustLM 과 비교하여 좋은 성능을 보인다.
RobustLM 과 REAR 모두 Self-RAG 보다 좋은 성능을 보여, 이 dataset construction 방법이 더 우수함을 검증한다.
ChatGLM3 와 Mistral 같은 LLM 들은 top-10 retrieval 에서 성능이 올랐는데, 이를 통해 irrelevant document 를 filtering 하는 것이 RAG 를 향상시키는데 매우 중요함을 검증한다.

5.3. Detaild Analysis

(1) Ablation Study

(2) Impact of Retrieved Documents
Single document Setting

Top-1 retrieved document 만을 활용하는 single document setting 이다.
Relevant Document 에 대해 Finetuned Self-RAG 와 finetuned REAR 가 모두 4-shot LLM 들 보다 훨씬 성능이 좋았다.
그러나 Irelevant Document 에 대해서는 REAR 만 성능이 좋아져, REAR 가 robustness 를 갖춤을 보인다.

Multi document Setting

Multiple document setting 으로 total number 와 relevance degree 의 impact 에 대한 분석을 진행한다.
왼쪽 figure는 document number, 오른쪽 figure 는 retriever capability 에 관한 분석이다.
REAR 는 single document setting (top retrived one) 에서도 좋은 성능을 보인다.
REAR 는 BM25 같은 weakest retriever 를 해도 가장 좋은 dense retriever 를 활용한 baseline 들보다 좋은 성능을 보인다.

6. Conclusion

In this paper, we aimed to enhance the selfawareness of source relevance in RAG systems, and proposed REAR, a RElevance-Aware Retrievalaugmented approach for open-domain question answering (QA). For model architecture, we explicitly incorporated a specially designed rank head to precisely capture the relevance signals, and employed it to guide the utilization of external knowledge. For model training, we designed an improved training method with bi-granularity relevance fusion and noise-resistant training, which enhance the capacities of fine-grained relevance assessment and adaptive use of retrieved documents. Extensive experiments on four datasets demonstrate the effectiveness of REAR’s relevance assessment and knowledge utilization.
As future work, we will extend the proposed approach REAR to dealing with more fine-grained source utilization (e.g., passage or sentence level augmentation), and also consider applying REAR to other knowledge-intensive tasks.

Limitations

For LLMs, the challenge of being misled by irrelevant retrieved documents is a significant obstacle, underscoring the crucial need for enhancing LLMs’ ability to adaptively utilize retrieved documents. In response to this issue, our work has concentrated on refining the architecture and training methods to bolster the effective use of retrieved documents by LLMs. We have implemented document-level relevance assessment and dynamic utilization strategies, significantly boosting the factual accuracy of generated content by LLMs. However, our current approach has not delved into guiding LLMs to focus more granularly on key sentences or tokens within the retrieved documents.
Moreover, the applicability of our methods across a broader spectrum of RAG tasks, such as those encompassed by the KILT benchmark, remains to be thoroughly evaluated. This gap presents a pivotal area for our future investigations.

[EMNLP2023] EtiCor: Corpus for Analyzing LLMs for Etiquettes

Wed, 27 Mar 2024 12:00:00 +0000

[pdf] [github]

Ashutosh Dwivedi, Pradhyumna Lavania, Ashutosh Modi
Indian Institute of Technology Kanpur (IIT Kanpur)

Abstract

(Etiquette) 에티켓은 상호작용에서 가장 중요한 요소이고, region-specific 한 특성을 가진다.
( EtiCor ) 이 연구에서는 Etiquettes Corpus 인 EtiCor 를 제안한다. Eticor는 전세계의 다 섯가지 지역에서 social norm 을 담고 있는 corpus 이다. 이것은 LLM 을 평가하기 위한 test bed 로 사용될 수 있다.
(Etiquette Sensitivitiy) 이에 더불어, Etiquette Sensitivity 라는 task 를 제시하여, Delphi, Falcon40B, GPT-3.5 등의 State-of-the-Art LLM 으로 실험한 baseline 을 제시한다. 그 결과 LLM 들이 non-Western 지역의 에티켓을 잘 이해하지 못함을 보인다.

1. Introduction

▶ Etiquettes
에티켓(Etiquettes)은 사회적인 행동에 대한 rule 과 convention 을 정의한다. 따라서 에티켓은 regional implication 을 담는 매우 중요한 요소이다. 몇 가지 social norm 은 전세계적으로 common 한 특성을 갖지만, 대부분의 지역에서 society-specific 한 norm 이 있으며 이 들은 다른 사회의 norm 과 충돌하는 것이 대부분이다. 만약 다른 지역이나 문화를 방문할 일이 있을 때, 이 사회적 norm 에 반하는 행동을 하지 않도록 유의하는 것이 필요하다.

최근 디지털 시대에 들어서면서 책보다는 web search 나 PDA (Personal digital assistant) 등의 도움으로 다른 사회의 norm 을 익힌다. 그러나, 과연 LLM 은 social norms-specific information 을 갖추고 있을까? 거의 대부분의 LLM 은 Western culture 에 skewed 되어 학습이 되어 있으며, 특히 Wikipedia 같은 primary data source 를 학습한다. 그러나, 아래의 테이블에서 볼 수 있는 것처럼 Wikipedia page 통계를 보면, 대부분의 content 가 English content 이며, 제작자들 역시 북미와 유럽 등의 Western society 임을 알 수 있다.

▶ EtiCor (Etiquettes Corpus)
이에 저자들은 LLM 이 다른 지역의 문화와 에티켓에 대한 지식 수준을 얼마나 잘 이해하는 지 파악하고, 또 generative langauge model 이 특정한 culturl norm 에 반하는 skewed 된 generation 을 하는지 확인하기 위하여 새로운 corpus 인 Eticor 를 소개한다. Eticor는 여러 지역에 걸쳐 social norm 을 포함하고 있다. Corpus 는 영어로 되어 있지만, 미래에 multi-lingual 로 만들 계획이다.

Eticor 의 필요성을 점검하면, 우선 LLM 이 다른 사회적 norm 을 잘 이해하고 생성하는지 판단할 수 있으며, 추후에 AI system 이 문화적 차이를 반영하는데 필수불가결하게 쓰일 가능성이 있다. 이를 위해 Etiquette Sensitivity 라는 새로운 task 를 제안한다.

2. EtiCor: Corpus for Etiquettes

EtiCor 제작을 위해 etiquette 의 사전적 정의와 social norm 의 set 을 정의한다. Etiquette 은 region-specific 하고 social and professional behavior 를 dictate 하며 subjective 하다. EtiCor 는 East Asia (EA), India (IN), Middle East and Africa(MEA), North America and Europe (NA), Latin America (LA) 다섯 가지 지역을 cover 한다. 아래의 Table 에서 각각 지역의 예시를 볼 수 있다.

EtiCor는 kitchen manner, food routine 등의 day-to-day data 를 모은다. EtiCor는 아래의 4 가지 타입을 대표적으로 구성하고 있다.

Dining and Festivals, 2. Visits and Social Interactions, 3. Travel, and 4. Business. 아래의 그림에서 각 타입에 대한 distribution 을 볼 수 있다.

EtiCor Creation
Government-aided wiebsete 같은 authentic 하면서 publicly available source 에서 정보를 모은다. 이러한 source 에는 regional etiquettes, tour guide points and pamphlets, etiquette information channels, and tweets and magazines on etiquettes 등이 포함된다. 모인 정보는 전처리 과정을 거쳐 정제된다.

Labeling

label +1 : acceptable (positive) class general eetiquette of the region
label -1 : non-acceptable (negative) class

3. Etiquette Sensitivity

Task Definition
LLM 이 region-specific societal etiquette 을 이해하는지 testing 하기 위한 task 를 제안한다. Etiquette Sensitivity 라는 이 태스크는 statement 가 해당 지역에서 appropriate 한지 predict 하는 것이 목표이다.

Experiments

Model : Delphi (11B), Falcon-40B, GPT-3.5 Turbo
Metric : F1-score

Results

예상대로 North America-Europe (NE) 이 다른 지역보다 훨씬 높은 점수를 보인다. 또한 abstention(기권) 수도 가장 적다.
이를 통해 LLM 이 western culture 에 bias 가 있음을 확인할 수 있다.
GPT-3.5 가 가장 나쁜 성능을 보이지만, 한편으로는 가장 적은 기권(abstention)을 보인다.

Wrong Predictions

Travel and Business etiquette 같은 global etiquette 은 잘한다.
Dining and Visits etiquette 같은 region-specific etiquette 에서 wrong prediction 비율이 높다.

Conclusion

In this paper, we presented EtiCor, a corpus of etiquettes covering major world regions. We further evaluated the performance of LLMs on the task of Etiquette Sensitivity and the results indicate a significant gap in knowledge and understanding of LLMs. In the future, we plan to develop regionspecific Adapters and integrate them into an LLM via a mixture of expert

Limitations

In this paper, we proposed a new corpus and experimented on the task of Etiquette Sensitivity in a limited set of few LLMs. We do not develop any new model and leave it for future work. This resource paper aims to introduce the corpus and the task and show the limitations of LLMs when it comes to region-specific etiquettes. The work is a first step towards making more sophisticated etiquette-sensitive models.

[EMNLP2023] Uncertainty Guided Global Memory Improves Multi-Hop Question

Mon, 25 Mar 2024 13:00:00 +0000

[pdf] [github]

Alsu Sagirova¹, Mikhail Burtsev²
¹ Moscow Institute of Physics and Technology, Dolgoprudny, Russia ² London Institute for Mathematical Sciences, London, UK

Abstract

(Multi-Hop QA) Multi-hop QA 는 두 가지 접근 방법이 많이 사용되는데, 첫째는 여러 supporting evidence 를 찾아내는 것이고, 둘째는 attention mechanism 을 활용하여 long input encoding 을 facilitate 하는 것이다.
(Lack of global attention) 그 중 attention-based 접근 방법은 reasoning step 을 연결해주는 explicit global contextual information 이 부족하다.
( GEMFormer ) 저자들은, (1) entire document 에서 relevant information 을 찾아 memory 에 저장하고 (2) 그것들을 local context 에 결합하는 two-step approach 인 GEMFormer 를 제안한다.
(Experiment) memory-augmented input 과 함께 pre-trained model 을 finetuning 한 결과, 세 multihop QA dataset 에서 baseline 대비 향상을 이룬다. 추가적으로, global explicit memory 가 정확한 answer 를 위해 필요한 supporting fact 를 잘 담아내는 것을 확인한다.

1. Introduction

▶Multi-Hop Question Answering(MHQA)
Transformer 의 발전에 따라 정답을 추출하기 위해 여러 reasoning step 이 필요한 multi-hop question answering task 에 대한 연구가 활발하다. MHQA 를 푸는 방법은 크게 두 가지로 나뉜다. 첫째는, sub-network 나 dedicated module 을 활용하여 supproting evidence 를 추출하여 활용하는 방법이다. 이 방법은 전적으로 evidence extraction 의 성능에 좌우되며, QA model 로 하여금 pre-selected factor 에 upper limit 이 되게된다. 둘째는, maxmimal input sequence length 를 크게하는 attention pattern 을 활용하여 long document encoding 을 활용하는 방법이다. 이 attention-based token representation 은 local information 과 global information 을 같은 vector 에서 처리하게 된다. 이렇게 되면, high-level contextual feature 가 long seuqence 에 퍼지게 되어, 접근이 어려워진다.

▶ GEMFormer (Global Explicit Memory Transformer)
이 문제를 해결하기 위해 저자들은, GEMFormer (Global Explicit Memory Transformer) 를 제안한다. GEMFormer 는 global information 을 저장하는 memory 를 활용하는 pre-trained language model augmenting 방법론이다. 이것은 task 를 풀기 위해 중요한 정보가 담긴 memory sequence 에 long input 을 concat 하는 방법이다. Token importance 는 language model uncertainty 로 정의된다.

2. Global Explicit Memory

GEMFormer 는 RoBERTa 를 backbone 으로 활용한다. Global explicit memory 는 정확한 reasoning 과 answer prediction 에 가장 중요한 document token 의 연속이다. Model 의 uncertainty 가 input 의 중요도로 활용된다. input sequence $x=[t_1,t_2,…,t_m]$ with $m$ tokens 에 대해, RoBERTa 의 LM head 를 통해 token probability vector $p=Softmax(LM-RoBERTa(x))$ 를 얻는다. 이후, Entory $H= -\frac{1}{n} \sum_{j=1}^n p_j log p_j$ 을 각각 input position 에 대해 구한다. 이 연구에서는 아래의 두 memory population condition 을 사용한다.

$\theta$ 는 threshold 이고, $k$ 는 memory size 이다.

모델에 question 과 context 가 입력이 되면, 각각 contextual token 에 대한 entropy 가 결정된다. 이 entropy 는 question 과 token-surrounding context 에의해 결정된다.(conditional 하다) Document 가 question-relevant collection 이라면, task-relevant token 의 entorpy 는 irrelevant one 보다 낮아야만 한다.

GEMFormer architecture 는 위의 그림과 같다. RoBERTa 의 maximum sequence length limit 을 맞추기 위해, contextual document 는 여러 segment 로 나뉜 후, question 이 concat 된다. Input processing 은 두 가지로 구성되는데 (1) document comprehension and memory population 과 (2) task-prediction generation 이다. 첫 번째 stage 에서, question-context segment 가 RoBERTa model 에 input 으로 들어간 후, LM head 에 의해 entropy 가 계산된다. 이후 위의 식 (1) 에 해당하는 entropy condition 을 만족하는 token 들이 선택되어 Global Memory (GM) 로 구성된다. 이후 두 번째 stage 에서는 question 과 globabl memory token 이 concate 되어 MHQA task training 에 사용된다.

실험은 세 영어 MHQA dataset : HotpotQA, 2WikiMultiHopQA, MusiQue-Ans 에 대해 진행된다. 각각은 HP, 2W, MSQ 라고 지칭한다.

3. Results and Discussion

Main Results

Low entropy 의 token 들이 Global Memory 로 활용될 때 좋은 성능을 보인다.

Improving ChatGPT performance

Question only 와 비교했을 때 Retrieved passage 가 있어야 좋은 성능을 보인다.
Retrieved Passage 에 Gloabl Memory 를 썼을 때는 오히려 성능이 안좋아졌지만 (Q+R > Q+M+R), Full context 를 썼을 때는 Global Memory 를 쓰는 게 더 좋다(Q+M+C > Q+C)

Ablation Study

Memory filling 을 위해서는 Question 이 필수불가결하고, Finetuning 은 도움이 되며, Rnadom memory 는 효과가 좋지 않다

위의 그림처럼 학습과정에서 Supporting fact 의 entropy 가 낮아지기 때문에, random memory 를 활용하기보다는 low entropy rule 을 쓰는 것이 더 좋다는 것을 확인할 수 있다.

Memory Analysis

Memory Size 가 크면 클 수록 좋다.

Conclusion

In this study, we demonstrated how utilizing uncertainty-based global explicit memory can enhance the model performance on MHQA tasks. Our findings indicate that utilizing low entropy context tokens can aid the model in MHQA reasoning, but only when the entropy estimation model is specifically fine-tuned to the target task. Experiments show that higher-performing models use larger memory sizes with better coverage of supporting facts.

Limitations

There are several limitations to this work. First, the global explicit memory augmentation of the input sequence may increase the training time by shortening the context chunk lengths. Second, the current implementation of memory token selection results in storing a significant fraction of irrelevant tokens which interferes with the calculation of correct predictions. We will work on methods to improve the relevance of information stored in memory

[EMNLP2021 best paper] Visually Grounded Reasoning across Languages and Cultures

Fri, 22 Mar 2024 04:32:00 +0000

[pdf] [blog] [github]

Fanyou Liu^*1, Emanuele Bugliarello^*2, Edoardo Maria Ponti^3,4, Siva Reddy^3,4, Nigel Collier¹, Desmond Elliott²
¹ University of Cambridge, ² University of Copenhagen, ³ Mila - Quebec AI institute, ⁴ McGill University

Abstract

(Motivation) ImageNet을 바탕으로 한 데이터셋과 인코더들은 대부분 영어 기반으로 되어 있어 북미나 서유럽에서 가져온 자료가 대부분이다.
(Dataset) 이를 해결하기 위해, 인도네시아어, 중국어, 스와힐리어, 타밀어, 터키어와 같은 다양한 언어를 대상으로 새로운 프로토콜을 도입하여 MaRVL 이라는 새로운 다국어 데이터셋을 구성했다.
이 데이터셋은 이미지 Pair에 대한 지역문화를 반영한 답변을 수집하였으며, 다양한 언어 간 전이 학습 결과를 평가했다.
이를 통해, 다국어 및 다문화 시스템 개발에 새로운 도전과 발전 가능성을 제시한다.

Introduction

ImageNet 은 컴퓨터 비전 연구의 기초를 제공했다. 이 데이터셋은 WordNet의 개념에서 선택된 개념 계층을 기반으로 한다. 이 데이터셋을 기반으로 NLVR2, MS-COCO, Visual Genome과 같은 다른 데이터셋이 구축되었고, “ResNet”과 같은 시각 데이터를 전처리하는 데 사용되는 사전 학습된 Encoder 도 만들어졌다. ImageNet에 포함된 개념과 이미지가, 이것이 만들어진 영어권과 북미, 유럽 문화를 넘어서서 얼마나 적합한가? 이들의 이상적인 분포를 정의하는 것은 어려울 수 있으며, 목적에 따라 다양할 수 있다. 그러나, 전 세계적인 대표성을 목표로 한다면, 이 데이터의 기원과 내용이 편향되어 있다는 증거가 있다. 이를 해결하기 위해, Yang et al.은 데이터에 개입하여 일부 범주를 필터링하고 재균형을 제안했다. 그러나, 원래 분포의 범위가 다양한 언어와 문화를 포괄하지 않는 한, 이것은 여전히 부족하다. 따라서, 다중 모달 기술의 글로벌 아웃리치를 확대하기 위해서는, 보다 근본적인 계층 구조의 개편이 필요하다. 사실, 가장 두드러진 개념과 그들의 prototypical 멤버들과 시각적 표현은 문화나 환경적 요인에 따라 달라질 수 있다. 이러한 변화는 언어별 리소스에서 개념을 (무작위로) 선택하거나 웹 쿼리에서 이미지를 자동으로 수집하는 데이터셋 생성의 일반적인 관행으로 인해 흐려질 수 있다.

저자는 이 연구에서, 기존의 프로토콜을 개선하여 다문화 및 다언어 데이터셋을 만드는 데 도움이 되는 편향을 완화했다. 특히, 원어민의 구성원들이 선정한 개념과 이미지를 선택하도록 했다. 인도네시아어, 스와힐리어, 타밀어, 터키어, 중국어 등 다언어와 다양한 문화에 초점을 맞추었으며, 원어민이 작성한 이미지 쌍을 비교하도록 요청하여 그라운드된 기술적 설명을 수집했다. 이를 통해 매칭 기반보다는 깊은 언어적 이해가 필요하며, 모달리티 정보 통합이 필요한 작업을 선택하였다. 이 연구에서 제시된 ‘Multicultural Reasoning over Vision and Language (MaRVL)’ 데이터셋의 예시는 위의 그림에서 볼 수 있다.

저자는 최신 시각언어 모델(Liu et al., 2019; Chen et al., 2020)을 MaRVL 데이터셋에서 평가하였다. 이를 위해 제로샷 및 번역기를 사용한 다국어 전이 학습을 수행하였으나, 성능이 영어 데이터셋(NLVR2; Suhr et al., 2019)에 비해 현저히 떨어졌다는 결과를 얻었다. 이러한 실패 원인을 조사한 결과, MaRVL은 이미지, 언어의 다양성 및 개념의 도메인 변화로 인해 매우 어려워졌다는 것을 발견하였다. 따라서, 현재 기준으로 MaRVL 데이터셋은 기존 벤치마크 대비 최신 모델의 일반화 능력을 더 신뢰할 수 있는 추정치를 제공할 수 있으며, 데이터셋, 주석 지침, 코드 및 모델은 marvl-challenge.github.io에서 제공된다.

Motivation

ILSRVC1K (ImageNet Large-Scale Visual Recogntion)는 컴퓨터 비전 분야에서의 중요한 평가 지표인데, 이는 ImageNet에서 추출한 1,000개의 개념을 기반으로 한다. 그러나 이러한 데이터셋이 다양한 언어와 문화를 대표할 수 있는지에 대한 의문이 제기되어, 개념을 보다 정확히 정의하는 것이 필요하다.

Concepts: Basic Level and Prototypes

저자는 concept 이란 category (e.g. BIRD)의 정신적 표현(mentral represenatation)이라고 하며, 비슷한 특성을 가진 객체와 사건의 인스턴스가 함께 그룹화된다. 그러나 모든 카테고리 멤버는 동등한 지위를 가지지 않으며, 일부는 다른 멤버들보다 prototypical에 가깝다(e.g. PENGUIS are more atypical BRIDS than ROBINS). 이러한 분류는 문화나 개인의 선호에 의해 제한될 수 있다. 따라서, prototypical, basic-level 카테고리 및 도메인에 대한 카테고리 수는 인지, 문화, 환경적 요소 및 개인적 선호에 의해 제한된다.

Limitations of ImageNet

ImageNet의 원래 anntotaion이 ‘개념이 보편적이고 기본 수준에 있는가’를 확인하기 위한 것은 아니었지만, 이러한 디자인 선택은 많은 언어와 문화에서 일상 생활 시나리오를 추론할 수 있는 다중 모달 시스템을 가능하게 하는 데 중요한 제한 사항으로 나타날 수 있다.

ImageNet concepts are not universal.

이미지넷은 영어 WordNet에 기반하여 만들어졌으며, 그 결과, 영어권에서는 익숙한 개념이지만 다른 문화권에서는 낯설거나 전혀 알지 못하는 개념도 포함되어 있다. 또한 다른 문화에서의 개념도 포함하지 못할 수 있다. 따라서 이미지넷 개념이 언어별로 얼마나 관련성이 있는지 측정하기 위해, 개별적으로 각 Synset을 Wikipedia 페이지에 매핑하고, 사용 가능한 언어를 추출하였다. 이 결과, 대부분의 Synset은 30개 이하의 언어에서만 존재하며, “universal”한 개념은 매우 적다는 것을 보여주고 있다. 또한 WALS 데이터베이스를 사용하여 언어 가족에도 동일한 논리가 적용되며, 대부분의 언어는 유라시아 대륙에서 나온 것임을 보여준다.

ImageNet concepts are overly specifc to English.

이미지넷(ImageNet)은 WordNet의 리프 노드에 속하는 BLENHEIM SϿANIEL 같은 지나치게 구체적인 개념을 포함하고 있으며, 이는 개(DOG)와 같은 기본 수준의 개념보다 더 구체적이다. 또한 이미지에 대한 사람들의 라벨에 사용된 용어의 깊이와 Ordonez 등(2013)의 일부 ImageNet 개념의 WordNet 내 깊이를 비교하여 ImageNet이 보다 미세한 Synsets를 선호하는 것을 확인할 수 있다. 이러한 문제는 영어뿐 아니라 다른 문화권에서 더욱 악화될 수 있다. 일본 악기 ‘코토’는 영어 사용자들은 ‘악기’라고 간단히 표현하는 반면, 일본어 사용자들은 더 정확한 표현인 ‘箏’ (코토)를 사용할 것으로 예상된다는 것을 저자들은 발견했다.

Sources of Bias

앞서 살펴본 편향의 잠재적 원인들에 대해 살펴본다. 특히, ImageNet, ILSVRC 1K 및 NLVR2와 같은 데이터셋 생성의 각 단계를 따로 검토한다. 이는 1) 개념 선택, 2) 후보 이미지 검색 및 3) 수동 정리 단계를 의미한다. 설계 단계에서 생길 수 있는 편향성 중 첫 번째는 개념의 선택이다. ImageNet은 WordNet으로부터 12개 하위 트리와 총 5,247개의 synset을 선택했다. 그 중에서도 보다 미세한 synset을 선호하여 “밀집된 의미적 계층”을 얻고자 했다. 그 중에서도 ILSVRC 2012-2017 공유 과제를 위해 1,000개의 개념이 임의로 선택되었다. 따라서 1,000개의 개념은 비기본적인 수준으로 편향될 가능성이 있다(예: 147개의 synset은 개 종류다).

저자는 Bias 의 두 번째 원인으로 후보 이미지 검색을 지적한다. 검색 엔진(Flickr와 ILSVRC 1K의 다른 지정되지 않은 엔진, NLVR2의 Google 이미지)에서 얻은 이미지는 성별(Kay 등, 2015)과 인종(Noble, 2018) 등 현실 세계의 분포를 따르지 않는다. 또한, 이들은 사용자의 프로필과 지역에 따라 결과를 사용자 정의한다. ImageNet의 검색어는 다시 영어로 표현되었으며, 일부는(지정되지 않음) 스페인어, 네덜란드어, 이탈리아어 및 중국어(만다린)로 표현되었으며, 이 중 후자만 서구 유럽 언어이다.

저자는 세 번째로, 이미지 필터링에도 추가적인 편향성이 존재할 수 있다고 말한다. 이는 검색 쿼리의 10%만이 적절한 품질을 가지기 때문에 필요하다. ImageNet에서는 Amazon Mechanical Turk를 통해 정리가 이루어졌다. Annotation 작업자들의 언어와 문화에 대한 정보가 없지만, 그들이 전 세계적 다양성을 대표할 수 있다는 근거는 없다. 또한 합의 없이 주석이 되지 않은 부분은 제거되어, 문화적 차이가 사라질 가능성이 있다. (의견이 상이한 것이 그저 다른 기본 수준이나 프로토타입을 나타낼 수도 있음에 유의)

MaRVL : Dataset Annotation

ImageNet 데이터셋에 내재된 편향성을 고려하여, 저자는 언어 원어민들이 생활 경험에 따라 발생한 개념으로 이루어진 데이터를 수집하기 위한 프로토콜을 정의한다. 이 데이터셋 생성은 다음과 같은 다섯 가지 단계로 이루어진다: 1) 언어 선택; 2) 범용적 개념 선택; 3) 언어별 개념 선택; 4) 이미지 선택; 5) 캡션 주석화.

Selection of Languages

본 연구에서는 인도네시아어, 스와힐리어, 타밀어, 터키어, 중국어(간체) 등 다양한 언어를 선택하여 그 언어권에서 일반적으로 사용되는 단어와 각 언어의 특정한 문화적, 지리적 배경에서 자주 사용되는 단어를 모은 데이터셋을 구축한다. 이를 통해 언어와 문화적으로 다양한 세계의 모습을 반영하고, 이들 데이터셋이 보다 보편적으로 사용될 수 있도록 하는 것이 목적이다.

Selection of Universal Concepts

저자는 다양한 언어와 문화에 대한 데이터셋을 만들기 위해 전 세계의 언어에서 공통적으로 존재하는 단어들을 선택했다. 이를 위해 인류학적 연구와 비교 언어학에서 유니버설한 개념들을 모은 리스트가 있고, 저자는 이 중 Intercontinental Dictionary Series를 선택하여 선정된 18개의 의미 분야에서 콘크리트한 객체와 사건을 다루는 단어들을 공유 풀로 사용하였다.

Selection of Language-Specific Concepts

각 언어에 대해 5명의 모어 화자 주석자를 고용하여, 각 의미 분야마다 그 문화에서의 5-10개의 구체적인 개념에 대한 위키피디아 페이지 링크를 제공하도록 한다. 각 의미 분야의 개념은 해당 언어를 사용하는 인구에서 흔하게 볼 수 있거나 대표적인 것 이어야 하며 “이상적으로는 물리적이고 구체적”이어야 한다. 그 결과, 각 언어마다 86-96개의 구체적인 개념을 얻을 수 있었다. Annotator 들 사이의 높은 합의는 선택된 개념들이 해당 문화에서 대표적임을 시사한다.

Selection of Images

각 언어별로 개념에 대한 이미지를 선택하기 위해 네이티브 어노테이터 2명을 고용한다. NLVR2의 이미지 선택 요구사항을 따르며, 여러 가지 개념이 포함된 이미지, 개념이 다른 물체와 상호작용하고 있는 이미지, 개념이 활동을 수행하고 있는 이미지, 다양한 물체나 특징을 나타내는 이미지를 선택해야 한다. 이미지는 CC 라이센스를 가진 자연스러운 이미지여야 하며, 각 언어의 사용자들이 일상적으로 볼 수 있는 이미지여야 한다. 이를 위해, 각 언어의 어노테이터는 다양한 소스를 사용하여 이미지를 모아야 한다. 이 과정에서, 8 개 미만의 유효한 이미지가 있는 개념은 제외된다.

Annotation of Captions

저자는 각 개념에 대해 8개의 이미지를 랜덤으로 선택하고, 이를 4개씩 묶어서 4개의 어노테이션을 만든다. 각 어노테이션은 두 개의 Pair는 참이고 나머지 두 개의 ㅖPair는 거짓인 설명문을 작성하도록 하며, 설명문은 “Theme Concept”을 중심으로 작성되도록 한다. 이후 검증자들은 설명문에 True/False 레이블을 지정하고, 오타나 문법적 오류를 체크한다. 레이블이 상이한 경우는 원래의 어노테이터가 재검토하도록 하며, 최종적으로 네이티브 스피커가 마지막 점검을 한다. 이렇게 생성된 데이터셋은 이미지 2개, 설명문, True/False 레이블로 이루어진다.

Dataset Analaysis

Human Validation

저자는 최종 라운드 평가를 진행하여 인간의 정확도와 주석자 간 합의를 report한다(최종적으로 확정된 캡션을 변경하지 않고). 각 언어마다 데이터 세트에서 무작위로 200개의 예제를 추출한다. 저자는 True/False 라벨을 가리고, 두 명의 새로운 평가자에게 예제를 재평가하도록 요청한다 (Fig. 3, right 와 동일). 모든 언어에서, 세 명의 Annotator (캡션 작성자와 두 명의 최종 라운드 평가자) 간의 kappas는 최소 0.887 이다(Tab. 1). 이 점수는, Landis and Koch (1977)에 따르면, 거의 완벽한 주석자 간 합의를 나타낸다. 캡션 작성자가 제공한 라벨이 올바른 경우, 평균 인간 정확도 점수는 대부분 높은 90%대에서 나타나며, 스와힐리어는 (93.0%) 제외하고도 매우 높다.

Concept and image statistics.

데이터셋에 대한 자세한 통계 정보를 Tab. 2에서 확인할 수 있다. 이미지 수집 이후, 각 언어별로 평균 5개의 개념이 걸러졌다. 최종적으로 선정된 개념 중 일부는 영어 WordNet에 없는데, 예를 들면 yağlı güreş (OIL WRESTLING)와 같은 스포츠, 四合院 (SIHEYUAN)와 같은 건축물, 그리고 ࣰࣱࣕ࣪ࣚ) DOSA)와 같은 음식이 있다.

Caption statistics.

MaRVL 캡션의 주요 통계 및 무작위로 추출된 250개의 NLVR2 캡션의 통계를 Tab. 3에서 볼 수 있다.

Image distribution.

저자는 MaRVL 이미지의 분포와 이것이 NLVR2와 어떻게 다른지 이해하기 위해 (1) MaRVL 이미지와 (2) 1,000개의 NLVR2 이미지의 특징을 추출하여 ImageNet 사전 학습 ResNet50 (He et al., 2016)를 사용하여 임베딩 분포를 UMAP (McInnes et al., 2018)를 사용하여 시각화한다. 상단의 그림 4에서 보여지듯, 중국어 이미지는 (NLVR2에서 온) 영어 이미지와 매우 다른 분포를 가지고 있다. 특히, 영어 NLVR2 이미지의 많은 클러스터가 각기 다른 종류의 개이다. 이는 ImageNet이 가져오는 문제로 인한 것이다. 그림 4 의 하단에서는 MaRVL의 두 언어 (인도네시아어와 스와힐리어)의 이미지 분포를 비교한다. MaRVL 내에서 이미지 분포가 여전히 언어별로 다양하다는 것을 알 수 있다. 이것은 대부분 다른 개념 세트 때문에 발생한다. 그림에서 보여지듯이, 서로 다른 클러스터는 두 지역이 매우 다른 동물 종을 가지고 있기 때문이다. ResNet50가 ImageNet에서 사전 학습되었으므로 형성된 클러스터는 ImageNet 개념으로 편향될 수 있다. 그림 4 (상단)에서 제안된 것처럼, NLVR2 이미지는 일반적으로 MaRVL의 중국어 이미지보다 더 잘 클러스터링된다.

Multilingual and multicultural statistics

저자는 MaRVL의 다중 언어 및 다문화 개념의 주요 통계를 ImageNet 및 NLVR2의 개념과 비교한 Fig. 2를 제시한다. MaRVL의 개념은 언어별로 구분되지만 ImageNet 및 NLVR2의 개념보다 더 많은 언어에서 발견된다. 저자는 이것이 MaRVL의 개념이 더 원형적이며, 더 많은 이웃 문화를 반영하기 때문이라고 추측한다. Fig. 2 의 중간 및 오른쪽 그래프는 MaRVL의 더 많은 개념이 더 많은 언어 군과 매크로 지역에서 발견되는 것을 보여줌으로써 이를 검증한다.

Limitations.

저자들은 가장 많은 언어를 커버하는 국제 주석 플랫폼 (proz.com 및 prolific.co)을 선택했지만, 저조한 자원을 가진 언어를 구사하는 사용자를 모집하는 것이 여전히 어려운 문제로 남아있다. 캡션 작성을 위해 언어 당 2-4명의 자격이 있는 주석 작성자를 찾을 수 있었다. 이는 개별 주석 작성자의 편향을 더 크게 나타낼 수 있다. 본 연구의 저자들 중에는 일부 언어를 원어민으로 구사하지 못하는 경우도 있다. 또한, 모든 개념은 위키백과 페이지에 매핑되어 있다. 자원이 적은 언어의 경우, 일부 개념에 대한 위키백과 페이지가 누락될 수 있다. 마지막으로, 각 의미 분야당 대략 5개의 개념만 선택된다. 이는 불균형적으로 자주 등장하는 개념이 서로 다른 범주에 분배되어 편향을 유발할 수 있다. 일반적으로, MaRVL 프로토콜은 여전히 개선될 여지가 있지만, 데이터셋 제작자의 편견을 최소화하기 위한 목표는 부분적으로 달성되었다.

Baselines

Vision-and-Language 작업을 위한 여러 사전 훈련된 Transformer 모델들이 제안되었다. 이들은 BERT 구조에서 영감을 받아 다중 모달 입력을 처리하도록 재설계되었다. 이들은 대개 영어로만 제공되는 대규모 이미지-텍스트 말뭉치(Sharma et al., 2018)에서 사전 훈련된다. M3P 모델(Ni et al., 2021)은 Unicoder-VL(Li et al., 2020a)을 확장하여 다국어 입력을 인코딩하는 BERT와 유사한 아키텍처 중 첫 번째 다국어 다중 모달 BERT 아키텍처를 만들었다. 사전 훈련은 다중 모달 영어 데이터와 텍스트만 있는 다국어 데이터를 모델링하는 것을 번갈아 수행한다. 이 논문에서는 이 접근 방식을 따르고, mBERT(Devlin et al., 2019)로 UNITER를 초기화하여 얻은 mUNITER와 XLM-RBASE(Conneau et al., 2020)로 UNITER를 초기화하여 얻은 xUNITER의 두 가지 다국어 변형을 제안한다.

UNITER 아키텍처는 BERT-BASE와 유사한 Transformer 계층 스택으로 구성되어 있으며, 입력은 언어와 비전 임베딩의 concatenation 이다. 언어 입력은 먼저 서브워드 단위로 분할(Wu et al., 2016; Sennrich et al., 2016)되고, {[CLS], w1,…,wT , [SEP]}와 같이 두 개의 특수 토큰으로 둘러쌓인다. 언어 임베딩은 BERT 아키텍처와 동일하게 얻어진다. 비전 입력은 사전 훈련 된 객체 검출기로부터 주어진 일련의 시각적 특징으로 구성되며, 전체 이미지를 인코딩하는 특수 기능 [IMG]를 추가한다. {[IMG], v1,…, vK} 각 특징은 입력 위치로 바운딩 박스 좌표를 사용하여 BERT와 유사한 임베딩 계층을 사용하여 임베딩된다. 마지막으로, 이미지-텍스트 쌍에 대한 전역 표현은 곱셈 풀링(multiplicative pooling) (Lu et al., 2019)을 통해 얻어진다. 이 때, [CLS] 토큰에서 추출된 텍스트 모드의 풀링 표현과 [IMG] 특징에서 추출된 시각적 모드의 풀링 표현이 요소별로 곱해져 이미지-텍스트 쌍을 위한 단일 벡터가 생성된다.

저자는 VOLTA에서 모델을 코딩하고, Bugliarello et al. (2021)이 제안한 제어된 설정과 동일한 데이터와 하이퍼파라미터를 사용하여 Pre-train 한다. 이를 통해 다국어 버전의 성능을 해당 단일 언어 버전과 공정하게 비교할 수 있다. 그 후, 저자는 Lu et al. (2020)에서 처음 제안된 방법을 따라 NLVR2에서 모델을 Fine-tuning 한다. 영어 fine-tuning 후, 다국어 모델은 ‘zero-shot’ cross-langugage transfer setting 에서 MaRVL에서 테스트된다. 또한 VOLTA에서 사용 가능한 다섯 개의 단일 언어 vision-and-language BERT 모델의 성능도 벤치마킹한다: UNITER, VL-BERT (Suet al., 2020), VisualBERT (Li et al., 2019a), ViLBERT (Lu et al., 2019) 및 LXMERT (Tan and Bansal, 2019). 이러한 모델들도 동일한 제어된 설정에서 pre-training되고, NLVR2의 영어 training set 에서 fine-tuning 된다. 교차 언어 전이에 대한 ‘번역 테스트’ 접근 방식을 따라 (Banea et al., 2008; Conneau et al., 2018; Pontiet al., 2021b), 이들 모델은 MaRVL의 테스트 세트에서 영어로 자동 번역된 결과를 평가한다.

Results

Baseline 모델들의 MaRVL에서의 성능을 Tab. 4에서 볼 수 있다. 모든 예제에 대한 정확도와 모든 해당 이미지 쌍에 대한 예측이 올바른 고유 문장의 비율인 일관성이라는 두 가지 지표를 report 한다. 특정 전이 방법에 대한 모든 모델 간의 차이가 통계적으로 유의하지 않음을 알 수 있다. 이는 같은 양의 데이터에서 사전 학습된 경우, 신경망 구조를 다르게 하는 것이 성능에 큰 영향을 미치지 않음을 나타낸다.

Zero-shot vs. translate test.

다국어 및 단일 언어 모델 모두 영어(NLVR2)에서 비슷한 성능을 보인다. 그러나 MaRVL에서 평가할 때, 영어를 제외한 언어에서는 제로샷 다국어 기준선의 성능이 10-20% 포인트로 급격히 하락하여, 기회수준 이상의 성능을 보인다. 놀랍게도, 이는 레이블되지 않은 텍스트가 풍부한 만다린어(ZH)와 같은 자원이 풍부한 언어에도 해당된다. 번역 테스트 기준선은 다른 언어에서 4-15%의 향상을 보이며, 터키어가 가장 많이 개선되었다. 그러나, NLVR2의 영어 성능과 비교하면 10% 이상의 상당한 차이가 있다. 이는 MaRVL의 데이터가 분포 밖에 있기 때문이라고 추측할 수 있다.

Disentangling shifts in distribution.

MaRVL이 어려운 이유가 크게 두 가지 있다:

1) cross-lingual transfer 와 2) 영어 데이터셋과 관련하여 이미지와 설명의 분포가 다른 out-of-distribution 이다. 이 두 가지 요소가 모델 성능에 미치는 영향을 평가하기 위해, 저자는 중국어 버전인 MaRVL-ZH를 대상으로 제어된 연구를 실시한다. 먼저, 저자는 MaRVL-ZH 를 수동으로 영어로 번역하여 기계 번역으로 인한 가능한 혼란을 제거한 후, Tab. 4에 나와있는 결과를 비교한다. Tab. 5(왼쪽 열)에 나와있는 것처럼, 번역 테스트 평가와 비교하여 mUNITER를 제외한 모든 모델이 정확도를 1-2%밖에 개선하지 못했기 때문에 번역은 꽤 신뢰할 수 있다고 결론을 내린다. 또한, 분포가 다른 개념들은 (평균적으로 정확도 10% 하락) 가장 많은 오류를 유발한다. 두 번째로, NLVR2 테스트 세트에서 1,000개의 데이터 포인트에 해당하는 250개의 고유한 설명을 샘플링하여 중국어로 수동 번역한다. 이를 NLVR21k라고 명명하고, 이 하위 집합에서 mUNITER와 xUNITER의 성능을 Tab. 5(오른쪽 열)에 나와있는 것처럼 나열한다. 모든 데이터 포인트가 도메인 안에 있지만, 저자의 다국어 모델 mUNITER와 xUNITER는 영어 NLVR2 1k 테스트 세트(중앙 열)와 비교하여 정확도가 16% 하락한다. 따라서 이 차이는 영어에서 중국어로의 다국어 전이로 설명될 수 있다.

Translate train.

마지막으로, ‘번역 훈련’이라는 세 번째 가능한 cross-language transfer 방법에 대한 베이스라인을 수립한다. 이를 위해 NLVR2의 training set을 중국어로 기계 번역하고, 이를 MaRVL-ZH에서 평가한다. mUNITER(62.5/18.7)와 xUNITER(61.8/16.7)의 성능은 ‘번역 테스트’에서 MaRVL-ZH를 영어로 기계 번역하는 경우와 거의 동일하다. 다시 한번 문화적으로 관련된 개념에 대한 접근 불가능성이 일반화를 방해하는 것으로 나타났다.

Conclusions and Future Work

현재 존재하는 시각-언어 데이터셋의 이미지와 개념이 영어 외의 많은 언어와 유럽과 북아메리카 이외의 문화권에서는 중요하지 않거나 대표적이지 않다는 것을 밝혀내었다. 이러한 편향성을 완화하기 위해, 저자는 이미지와 캡션의 선택을 완전히 원어민들이 결정하는 새로운 주석 프로토콜을 개발했다. 또한, 인도네시아어, 중국어, 스와힐리어, 타밀어, 터키어의 다양한 언어에서 이미지 쌍을 비교하고 대조하는 설명을 수집하여 이를 기반으로 다양한 언어와 문화권에 대응하는 다문화 및 다국어 데이터셋인 MaRVL을 공개하였다. 이를 바탕으로, 다양한 다국어 및 다모달 베이스라인 모델을 개발하고 평가하여, 이 모델들의 성능이 영어 데이터셋과 비교해 꽤 낮은 수준이라는 것을 발견하였다. 이는 MaRVL이 영어 문화권 이외의 실제 적용 분야에서 모델의 적절성을 더 정확하게 평가할 수 있다는 것을 보여준다. 이에 따라, 앞으로는 MaRVL을 기반으로 객체 인식과 같은 다른 작업들에 대한 모델 성능 평가를 진행할 예정이며, 비교 학습을 기반으로한 다국어 확장 모델을 실험할 것이다.

[TACL2021] Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

Wed, 20 Mar 2024 09:00:00 +0000

[pdf] [code and dataset]

Mor Geva^1,2, Daniel Khashabi², Elad Segal¹, Tushar Khot², Dan Roth³, Jonathan Berant^1,2
¹ Tel Aviv University ² Allen Institute for AI ³ University of Pennsylvania

Abstract

(Explicit Multihop reasoning) 현재 Multihop reasoning 의 큰 단한계점은 question 이 explicit 하다는 것이다.
( StrategyQA ) 이 논문에서는 question 속의 reasoning step 이 implicit 하게 내재되어 있는 StrategyQA 라는 QA benchmark 를 제시한다. 저자들은 term-based priming 기법을 통해 annotator 들로 하여금 창의적인 질문을 생성하게 하였고, adversarial filtering 과정을 거쳐 벤치마크를 생성하였다.
(Statistics and Analysis) 2,780 example 에 각각 decomposition 과 evidence paragrah 를 포함한다. StrategyQA 는 short, topic-diverse 하면서 넓은 범위의 strategy 를 cover 하고, 87% 점수의 human score 와 66% score 의 baseline score 를 report 한다.

1. Introduction

▶ Multi-hop Reasoning
최근 input 을 해결하기 위한 여러 단계의 추론이 필요한 Multi-hop reasoning 에 대한 연구가 활발해지며, 모델과 벤치마크가 많이 제안되었다. 하지만 기존의 벤치마크들의 question(query)은 보통 explicit 하게 주어진다. 예를 들어, 위의 그림에서처럼 기존의 데이터셋들은 “Was Aristotle alive when the laptop was invented?” 와 같이 정보를 추출하기 위해 직접적으로 언급이 되게끔 질문이 구성된다. 그러나 real-life question 은 _“Did Aristotle use a laptop?”_과 같이 같은 step 으로 해결하지만, question 속에 정보가 implicit 하게 내재되어있는 경우가 많다.

▶ Challenge in Implicit Question
IMPLICIT question 에 답변을 하는 것은 challenging 하다. 우선, question 과 context (evidence) 사이의 overlap 이 적기 때문에 정보를 retrieve 하기 어렵다. 게다가 question 의 길이가 보통 짧기 때문에, 이러한 측면에서 implicit question 은 lexcial overlap 등의 shortcut 을 찾을 확률이 적어진다.

▶ StrategyQA
이에 저자들은 strategy question 으로 구성된 implicit multi-hop reasoning 을 위한 Boolean QA bnechmark 인 StratgyQA 를 제시한다. 여기서 말하는 strategy 는 question 으로부터 atomic sub-question 을 추출할 수 있는 능력을 말한다. (=decomposition ability)

Strategy question 을 구성하기 위해 crowdsourcing 을 활용하는 것은 쉬운 일이 아니다. Annotator 에게 창의성(creativity)를 요구해야 하는 과정이므로, 기존 벤치마크에서 entire context 를 보여주고 multi-hop question 을 생성시키는 것을 넘어서야 한다. 이에 저자들은 다음의 과정들로 dataset construction pipeline을 구성한다. (a) Annotator 로 하여금 imagination 과 creativity 향상을 위해 최소한의 context 를 부여하고, (b) diversity 를 위해 최대한 많은 annotator 를 고용하였으며 (c) adversarial model 학습을 통해, data collection 과정에서 recurring pattern 을 방지하고 난도를 증가시키는 과정을 거친다.

StrategyQA 는 question decomposition 과 evidence paragraph 를 포함한다. 위의 Figure 에서 “D” 처럼 question 을 sub-question 으로 나누는 decomposition 정보와 그 것에 해당하는 Evidence (“E”) 로 구성되어있다. 벤치마크 분석에서 StrategyQA 는 physics, geography 등 다양한 knowledge domain 에 걸쳐있으며, retrieval 과 QA 모두에서 challenging 함을 드러낸다.

2. Strategy Qeustions

2.1. Desiderata

QA 벤치마크 생성에는 여러 desired criteria 가 존재할 수 있다. Answerable 에 대한 연구도 많이 진행되고 있고, Hallucination 에 대한 연구도 많이 진행되고 있기 때문에 이러한 needs 에 따라 데이터셋이 존재할 수 있다. StrategyQA 에서는 이러한 측면보다는 implicit query 구성에 desiderata 를 맞춘다.

(1) Multi-Step : 첫 번째 figure 처럼 여러 개의 질문으로 구성되어 있으며, 각 질문의 답변을 통해 logical operation 까지 할 수 있어야 한다.
(2) Feasible(Answerable) : Question 은 corpus 속의 paragraph 로부터 answerable 해야 한다.
(3) Implicit : Key property 이며, questino 의 자연어 (natural language) 그대로 쉽게 정보를 추출하기 힘들어야 한다.
(4) Definite : 명확한 대답을 할 수 있어야 한다. 예를 들어, “나무에 전기가 통하는가?” 라는 질문에 대해 어떠한 나무는 잘 통할 수 있지만 (환경에 따라) 정답은 generally “no” 인 것처럼 명확한 대답을 할 수 있어야 하고, “햄버거를 샌드위치라 볼 수 있는가?” 같은 답변이 갈릴 수 있는 질문은 하지 않는다.
(5) Boolean: 모든 대답은 yes or no 이다.

논문에서 말하는 Implicity 의 정의 :

a precise definition of implicit questions based on lexical overlap is elusive, but a good rule-of-thumb is the following:
If the question decomposition can be written with a vocabulary limited to words from the questions, their inflections, and function words, then it is an explicit question.

2.2. Decomposing Strategy Questions

저자들은 위의 desiderata 에 따라 모든 question 이 decomposition 할 수 있게 annotate 한다. 기존의 방법들은 대부분 rationale 혹은 supporting fact 라고 불리는 작은 text snippet 으로 decomposition 이 구성되었지만, 저자들은 진정한 reasoning 은 context 속에 explicit 하게 등장하지 않는다고 주장한다. 이에 저자들은 모든 question-answer pair 에 strategy question decomposition 을 적용한다. 모든 question $q$ 는 $n$ 개의 step $<s_1, s_2, …, s_n>$ 으로 구성되고, 각각의 step $s_i$ 는 single-step question 이며 각각 spcial reference 를 포함한다. 이 special reference 는 직전 step 의 결과를 refer 하는 placeholder 이다. 마지막 decomposition 인 $s_n$ 은 final answer 를 return 하는 step 이다.

위의 Table3 에서 decomposition step 을 볼 수 있다. 첫번째 row의 explicit question ([QDMR]) 은 decomposition 이 small vocab 으로 제한된다. 그러나 나머지 세 row 의 implicit decomposition 을 vocab 에 제한 없이 어떠한 token 도 등장할 수 있으며 각각은 implicit reasoning 을 구성하기만 하면 된다. 각각의 decomposition step 은 retrieval step 과 operation step 으로 나뉘는데, 맨 처음 figure 나 위의 두 번째 row 처럼 일단 정보를 추출해오는게 retrieval step 이고, 추출된 정보에서 logical inference 를 하는 것이 operation step 이다.

3. Data Collection Pipeline

※ 논문참고

4. The STRATEGYQA Dataset

Dataset 구성을 위해 29 question writers, 19 decomposers, 54 evidence matchers 를 고용하였다. 2,835 개의 question 을 collect 했으며, 그 중 55개는 filter 되어 2,780 개의 question 으로 구성된다.

4.1. Dataset Statistics

4.2. Data Quality

Do questions in STRATEGYQA require multistep implicit reasoning?
Quality 측정의 위해 100 random sample 을 조사하였다. 두 expert (=author) 가 조사하였다고 한다. 그 결과, 대부분(81%)이 valid multi-step implicit question 이었다고 한다.

Do questions in STRATEGYQA have a definitive answer?
Expert 가 web 에 접근할 수 있는 환경에서 question 들을 분석한 결과, 94% 의 question 이 agree 하고 단 2% 에서만 disagree 했다고 한다. 나머지 4% 는 abimgiuous 했다.

What is the quality of the decompositions?
Expert 는 decomposition 이 잘되었는지, 그리고 그것들이 explicit or implicit 한지 분석하였다. 그 결과, 83% decomposition 이 valide 하게 sub-question 으로 break-down 되었으며, 17% 는 explicit 하게 decomposition 되었다고 한다. 그러나 그 17% 중 14% 는 이미 original question 자체가 explicit 하다고 한다.

Would different annotators use the same decomposition strategy?
50 개의 sample 을 뽑은 뒤, 다른 worker 들에게 question 을 decompose 하게 시켰다. 그 결과, 44개 (88%) 에서 같은 reasoning path 를 보였다. 이 결과는 다른 worker 들을 활용해도 decomposing 과정에 같은 strategy 를 사용한다는 것을 보여준다.

Is the evidence for strategy questions in Wikipedia?
각각의 decomposed question 들이 evidence 와 matching 되는지 세 worker 가 매겼을 떄, 88.3% 의 대부분의 question 이 fully coverd 되었고, 86.9% question 이 최소한 하나의 worker 에게 evidence 와 match 된다고 한다.

4.3. Data Diversity

(1) Reasoning Skills

Strategy Diversity

Domain-related and logical reasoning skill diversity

(2) Question Topics

4.4. Human Performance

100개의 sample 을 뽑아 expert (=author)가 question 에 대답을 해본 결과이다. 87% 정도의 정답률을 보이고, error analysis 에서 main reason to failure 는 evidence 를 찾기 힘들 때이다.

5. Experimental Evaluation

세 가지 측면에서 벤치마크를 분석한다.

a) LM 이 strategyQA 를 잘 푸는가?
b) relevent context 를 retreival 하는 것이 helpful 한가?
c) decomposition 이 도움이 되는가?

5.1. Baseline Models

Backbone model : BOOLQ, MNLI, TWENTY QUESTION, DROP 으로 finetuned 된 ROBERTA
Setting : No context / With context (by BM25 Retrieval)
Predicting Decompositions : BART 를 학습시켜, question 을 decomposition
Baseline dmodels : ROBERTa - No retrieval / REBERTA : retrieval with gold decomposition / ROBERTA gold decomp and gold paragraph

5.2. Results

Strategy QA performance

Maximum score : ACC 72.0

Retrieval Evaluation

Maximum score : Recall@10 0.282

Concclusion

We present STRATEGYQA, the first dataset of implicit multi-step questions requiring a widerange of reasoning skills. To build STRATEGYQA, we introduced a novel annotation pipeline for eliciting creative questions that use simple language, but cover a challenging range of diverse strategies. Questions in STRATEGYQA are annotated with decomposition into reasoning steps and evidence paragraphs, to guide the ongoing research towards addressing implicit multi-hop reasoning.

[ACL2023] ReAugKD: Retrieval-Augmented Knowledge Distillation For Pre-trained Language Models

Mon, 18 Mar 2024 08:00:00 +0000

[pdf] [github]

Jianyi Zhang¹, Aashiq Muhamed², Aditya Anantharaman², Guoyin Wang², Changyou Chen³, Kai Zhong², Qingjun Cui², Yi Xu², Belinda Zeng², Trishul Chilimbi², Yiran Chen¹
¹ Duke University, ² Amazon ³ University at Buffalo, SUNY

Abstract

(Knowledge Distillation) Large-scale pre-trained LM 을 작은 모델로 distillation 하는 연구가 성행하고 있다. 기존의 KD 접근 방식은 teacher model 의 soft label 과 intermediate activation 을 trasnfer learning 하여 student model 을 학습시킨다.
( ReAugKD ) 이 논문에서는 teacher 의 soft model 에 더불어 kowledge base 형태의 non-parametric memory 를 같이 활용할 경우 더 좋은 generalization 성능을 보이는 distillation 방법을 제안한다. Student 모델로 하여금 knowledge base 를 효과적으로 retrieve 하는 이 ReAugKD framework 은 teacher 와 student embedding space 에서 relational knowledge 를 align 하는 loss 로 학습한다.
(Experiment) 실험 결과, GLUE benchmark 에 대해 State-of-the-Art 성능을 보인다.

1. Introduction

▶ Knowledge Distillation (KD)
BERT, RoBERTa, Electra 등의 LM 이 좋은 성능을 보이지만, 이 것들은 M ~ B 단위의 param 을 가지고 있어 제한된 환경에서 가동이 힘들다. 이에 성능 좋은 위의 모델들은 teacher model 로 하고, param 수가 더 적은 student model 로 지식을 전달하는 knowledge distillation (KD) 연구가 활발하다. 기존의 KD 모델들은 typically student param 속의 지식과 teacher 의 output prediction 의 divergence 를 최소화 하는 방식으로 학습한다. 이러한 단순한 KD 방법은 student 모델의 작은 param 떄문에 어느정도 한계점이 있다. 특히 LLM 에서 많이 나타나는 task-specific knowledge 를 distill 하여 학습하기는 힘들다.

▶ Retrieval-Augmented Knowledge Distillation (ReAugKD)
저자들은 이 문제를 해결하기 위해 Retrieval-Augmented Knowledge Distillation (ReAugKD) 방법론을 제안한다. ReAugKD 방법은 implicit parametric memory 에 더하여 non-parametric external memory 를 가져와서, kNN retrieval 을 통해 retrieve 를 한다. Key intuition 은 teacher 의 task-specific knowledge 로부터 가져올 수 있는 external memory 를 studnet 모델이 활용할 수 있는 능력을 갖추게 하는 것이다.

▶ Experiment
실험 결과, GLUE benchmark 에서 State-of-the-Art 를 달성했으며, retrieval 을 하지 않은 방법보다 단 3% 의 latency overhead 만 존재함을 보인다. 또한, ReAugKD 방식을 통한 학습이 student model 의 generalization 성능 향상을 이끄는 것을 확인한다.

2. Methodology

ReAugKD 방법은 두 개의 main phase : Training phase 와 Inference phase 가 존재한다.

2.1. Training Phase

Training phase 는 두 개의 step 이 존재한다.

첫 번째 step 은 sepcific downstream task 에 finetuned 된 teacher model 에 linear projection head $L$ 을 붙인다. 이 projection 의 input dimension 의 teacher embedding dim 이고, output dimension 의 student embedding dim 이다. Teacher model 의 다른 param 은 freeze 하고, head $L$ 의 param 만 supervised contrastive loss 를 통해 학습시킨다.

두 번쨰 step 은 teacher embedding with head $L$ 과 teacher osft label 을 가지고 Knowledge Distillation 을 하는 것이다.

2.2. Loss Function

Notations

$N$: batch
$x_i$ : student embedding
$y_i$ : student prediction
$\hat{y_i}$ : teacher soft label
$z_i$ : teacher’s prediction head

$z_i$ 와 anchor $z_j$ 의 similarity distribution $q_{i,j}$ 는 아래와 같다.

$q_{i,j}$ 는 batch 속의 다른 embedding 들과의 relational knowledge 의 cosine distance 를 담고 있다고 해석할 수 있다.

아래의 $\hat{q_{i,j}}$ 는 teacher embedding 과 student embedding 사이의 similarity matrix 이다.

Loss function 은 두 distribution $q_{i,j}$ 과 $\hat{q_{i,j}}$ 의 divergence 를 줄이는 것으로 학습된다. 추가적으로, corss-entropy loss 로 distillation 학습을 진행하여 최종적인 Loss 는 아래와 같다.

2.3. Inference Phase

Teacher embedding 들과 prediction 들을 comprise 하여 Knowledge base (KB) 를 구성한다. 이후, HNSW 알고리즘을 활용하여 K-nearest neighbor (KNN) 방법을 활용한다. 즉 student 의 embedding 과 prediction ($x_i$, $y_i$) 를 토대로, 가장 비슷한 teacher embedding 과 prediction ($z_i$, $\hat{y_i}$) 을 KB 에서 KNN classifier 를 통해 retrieval 해온다. K 개의 결과를 retrieval 해온 후 Average 하여 아래의 weigthed average of soft label 을 얻고,

hyperparameter $\beta$ 를 통해 두 prediction 을 섞어준다.

3. Experimental Results

Experiment setting

Backbone model : BERT-base -> 6-layer BERT (768 dim)
Benchmark : GLUE
Baseline method : vanilla KD, TAKD, RCO, RKD, DML, PKD, ProKT, SFTN, MetaDistil

Experimental Results on GLUE

기존 SOTA 인 MetaDistil 을 0.34% 앞서는 SOTA 를 달성한다.
Metadistil 이 MRPC 에서는 더 좋지만, ReAugKD는 meta-learning 을 필요로 하지 않기 때문에 더 효율적이다.
Inference 단계의 retrieval 을 붙였을 때 0.37% 정도 성능향상이 있고, retrieval 을 하지 않아도 SOTA 급의 성능이다.

Number of Neighbors Retrieved (k)

Original inference time 에 비하여 3% 정도의 additional time overhead 가 있다. (CPU로만 했음에도)

Conclusion

In this paper, we present ReAugKD, a knowledge distillation framework with a retrieval mechanism that shows state-of-the-art performance on the GLUE benchmark. In the future, we plan to expand the knowledge base with more information from the teacher and extend it to additional tasks.

Limitations

Our method relies on having access to teacher embeddings and prediction which may not always be possible in a black-box distillation setting. Retrieval augmentation also requires maintaining a knowledge base that is memory intensive.
The cost of the retrieval process is dependent on the size of the training corpus, which can be a limitation when dealing with very large training datasets.
Conducting dataset distillation (Wang et al., 2018b) on the training corpus to further reduce memory cost and retrieval time is an important future step for our framework.

[Arxiv 2305] Trusting Your Evidence: Hallucinate Less with Context-aware Decoding

Fri, 15 Mar 2024 07:00:00 +0000

[pdf] [github]

Weijia Shi^1*, Xiaochuang Han^1*, Mike Lewis², Yulia Tsvetkov¹, Luke Zettlemoyer¹, Scott Yih²
¹ University of Washington, Seattle, WA, ² Meta AI

Abstract

(Context-Aware Decoding) LM 의 decoding 과정에서, context 를 사용할 때와 사용하지 않을때 (with and without) 의 차이점을 극대화시키는 contrastive output distribution 을 따르는 context-aware decoding (CAD) 방법론을 제안한다.
(Improving Faithfulness) Context-aware decoding 방법으로 OPT, GPT, LLaMA, FLAN-T5 summarization task 에서 faithfulness 향상을 이뤄낸다.
( Resolving Knowledge Conflict ) 추가적으로, CAD 는 provided contxet 가 prior knowledge 와 충돌할 때, 그 conflict 를 해결하는데 효과적이라고 주장한다.

1. Introduction

▶ How LM deal prior knowledge and context knowledge
Language model (LM) 이 coherent 하고 fluent 한 generation 을 잘 생성하는 것은 공공연한 사실이다. 그러나, 현재까지도 LM 이 param 속에 갖고 있는 prior knowledge 와 외부 지식으로 주어지는 context knowledge 두 가지 타입의 knowledge source 를 generation 과정에서 어떻게 처리하는지에 대한 연구가 더 필요하다.

초창기 연구에서는 prior knowledge 에 집중하여, context knowledge 를 사용하지 않았을 때 생기는 hallucination 에 대해 집중한 연구들이 많다. 최근에는 이런 hallucination 극복으로 외부 지식을 context knowledge 로 주어 LM 이 generation 과정에 활용하게 하는 Retrieval augmented approach 가 많다. 하지만, 둘 사이에 conflict 가 있을 때 LM 이 어떻게 처리하는지는 문제가 된다. 한 가지 예시로, LLaMA 에게 “아르헨티나가 1978, 1986, 2022 년에 월드컵 우승을 했다” 라는 외부 지식을 전달한 상태에서, “아르헨티나가 월드컵을 몇 번 우승했어?” 라고 질문을 한다면 prior knowledge (2022년 전 지식) 에 따라 “Two” 라고 대답을 한다.

▶ CAD : Context-Aware Decoding

이 연구에서는 simple context-aware decoding 방법론을 제안한다. 위의 figure 처럼 cAD 는 with context <-> without context 에 대한 output distribution difference 를 amplify 하는 새로운 output distribution 을 sample 한다. 이것은 새로운 형태의 contrastive decoding 이다. 기존의 연구([1]) 에서, more relevant contextual information 이 주어졌을 때, prior knowledge 를 down-weight 하는 contrastive decoding 방식이 존재한다. CAD 는 추가적인 additional training 없이 사용 가능한 off-the-shelf 방법이다.

▶ Experiments
실험 결과, OPT, GPT-Neo, LLaMA 등의 vanilla LM 뿐 아니라, FLAN 등의 instruction-finetuend LM 에서 모두 faithfullness 향상을 확인한다. CNN-DM 데이터셋에 대해 LLaMA-30B 에 CAD 를 적용했을 때, ROUGE-L 이 21%, summary factuality evaluation metrics 에서 14.3% 향상을 이룬다. 특히, CAD 는 knowledge conflicting task 에서 beneficial 하다. CAD 는 knowledge conflicts QA dataset([2]) 에서 LLaMA-30B 에 기존방법보다 2.9 배 향상을 이뤄냈고, model size 가 커지면 커질 수록 이 효과의 증가를 확인한다.

2. Method

2.1. Background

Response 식은 아래와 같다.

여기서 context $c$ 는 prior knowledge 에 conflict 하거나 unfamililar 한 내용을 포함할 수 있다. 예를 들어, 첫 번째 Figure 처럼 “아르헨티나가 1978, 1986, 2022 년에 우승했다” 라는 외부 지식이 context $c$ 로써 prior knowledge 에 충돌하게 주어질 수 있다.

2.2. Context-aware decoding

이를 위해, 저자들은, context 없이 prior knowledge 만으로 output knowledge 를 추출한다. 따라서 prior knowledge 는 아래의 식이 되고,

이것과 context $c$ 를 포함한 $y_t$ 사이의 Point-Wise Mutual Information (PMI) 을 활용하여 adjust 시킨다.

이 식 자체만으로는 valid output distribution 이 아니므로, normalize 가 필요하다. 따라서 softmax 를 활용하여 최종적인 context-aware decoding 이 된다.

3. Experimental Setup

3.1. Datasets and Metrics

Summarization : Dataset CNN-DM, XSUM Metric ROUGE-L, BERT-Precision, FactKB
Knowledge Conflicts : Dataset MemoTrap, NQ-Swap

아래의 Tasble 에 context 를 활용한 두 task 의 example 을 볼 수 있다.

3.2. Models and Baselines

모델은 OPT-13B, OPT-30B, GPT-Neo2.7B, GPT-Neo20B, LLaMA-13B, LLaMa-30B 를 활용하고, instruction-finetuned model로 FLAN-T5-3B, FLAN-T5-11B 를 활용한다. $alpha$ 값은 Summarization 에는 0.5 로, Knowledge conflict task 에는 1.0 으로 설정하였다. 비교 baseline 은 knowledge conflict task 는 greedy decoding 와 비교하고, summarization task에는 top-p (p=0.9) sampling 과 비교한다.

4. Results

4.1. Main Results

(1) Summarization

We observe that CAD outperforms the standard decoding algorithm by a large margin in all eight models across both datasets
Specifically, when applied to LLAMA30B in CNN-DM, CAD leads to 21% increase in ROUGE-L, 14.3% increase in factKB and 7.8% increase in BERT-P.
이 결과는 CAD 가 quality 뿐 아니라 Factuality 측면에서도 효과적임을 보인다.

(2) Knowledge Conflicts

CAD is significantly better than the regular decoding in all settings, with the exception of a minor decrease observed for FLAN-T5 on the non-conflict NQ dataset
Despite this, CAD achieves substantially better performance on the knowledge conflict datasets, e.g., CAD improve GPT-Neo 20B by 54.4% on Memotrap and by 128% on NQ-SWAP
CAD 가 LM 으로 하여금 prior knowledge 가 conflict 할 때의 scenario 에서 효과적임을 보인다.

4.2. Analysis

(1) Quantitative Analysis

XSUM 의 경우 Regular decoding 은 article 에 없는 말들을 생성하지만, CAD 는 오로지 article 에 의존하여 잘 생성한다.
MemoTrap 에서 standard decoding 은 instruction 을 무시하지만, CAD 는 instruction 을 잘 따른다.

(2) CAD brings consistent improvement to LMs with different sizes.

다양한 모델 사이즈에서 모두 CAD 방법이 효과적이었다.
Memotrap 과 NQSWAP 의 경우, 모델 사이즈가 커지면 커질 수록, CAD 를 써서 얻는 효과가 증가한다.

(3) Effect of adjustment level

$lambda$=0.5 일 때 가장 효과가 좋다.

Conclusion

Off-the-shelf language models may suffer from an insufficient attention to the supplied context compared to its learned prior knowledge, leading toan unfaithful generation to the input context. We present context-aware decoding, a simple inferencetime method that downweights an output probability associated with the model’s prior knowledge to promote models’ attention to the contextual information. We experiment on two families of tasks that require a strong attention to the context, summarization and knowledge conflicts tasks. We show that CAD provides more reliable and factual outputs across different language models of various
sizes.

[Arxiv 2401] Blinded by Generated Contexts: How Language Models Merge Generated and Retrieved Contexts for Open-Domain QA?

Wed, 13 Mar 2024 13:08:00 +0000

[pdf]

Hexiang Tan^♠♡, Fei Sun^♠†, Wanli Yang^♠♢, Yuanzhuo Wang^♠, Qi Cao^♠, Xueqi Cheng^♠♡
^♠ CAS Key Laboratory of AI Safety & Security, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China ^♡ University of Chinese Academy of Sciences, Beijing, China ^♢ Nankai University, Tianjin, China

Abstract

(merging generated context with retrieved context) Retrieval-augmented generation task 에 대하여, LLM 에 추가적인 정보를 위하여 LLM 스스로 혹은 다른 LLM 이 generated 한 context 를 merging 하려는 시도들이 증가하는데, 이에 대한 연구가 부족하다.
(Conflicting dataset) 저자들은 generated context 와 retrieved context 중 하나에만 golden answer 아 있는 dataset 을 생성하여 reponse 의 origin 을 trace 하는 연구를 제안한다.
(Experiment) 저자들은 실험에서 GPT-4/3.5, LLaMa2 에서 generated context 를 favor 하는 significant bias 를 발견한다. 또한, LLM-generated context 가 query 에 대해서는 훨씬 높은 relevancy 를 가지는 것을 발견한다.
(Takeaway) LLM 이 diverse context 를 어떻게 merge 하는지 이해하며, 현재의 RALM 에 대한 진보에 기여할 수 있다.

1. Introduction

▶ Using Auxiliary info in LLMs
최근 Knowledge-intensive task 에서 LLM 에 auxiliary information 을 활용하여 성능을 끌어올리는 연구들이 많이 존재한다. ([1]) 최근 여러 연구에서 Retrieval-augmented approach 를 대신하여, LLM 이 생성한 context 를 활용하는 generation-augmented apporach([2],[3]) 를 차용한다. 대표적인 예시로, GENREAD 가 있다.

▶ Hybrid Approach
최근 연구들([4],[5]) 에서는 Retrieved context information 과 generated context information 을 합쳐서 넣는 hybrid approach 에 대한 방법론이 제시되고 있다. 그러나 이 hybrid approach 에는 significnat challenge 가 존재하는데, diverse source 의 conflict 가 information integration 의 effectiveness 를 impede 한다([7])는 것이다. 이 연구에서는 LLM 이 이 generated-retreived context 사이 conflict 를 어떻게 resolve 하는지를 탐구한다.

▶ How LLMs handle conflict between retrieved info and generated info

저자들은 특별한 케이스에 대하여, hybrid approach 가 위의 그림처럼 실패하는 것을 보인다. 이 이유를 탐구하기 위해, LLM 이 merging 하는 과정을 나눠서 분석하는 systematic framework 을 제시한다. 저자들은 generated and retrived context 중 하나에만 정답이 있는 conflicting dataset 을 의도적으로 생성한 뒤, LLM 이 어떤 context 를 고르는지를 탐구한다.

여러 실험 결과, GPT-4/3.5, LLaMa2 같은 SOTA LLM 들에서 generated context 를 favor 하는 siginficant bias 를 발견한다. 추가적으로, 이 genreated context 가 LLM 스스로 만든 것이든, 다른 LLM 이 만든 것이든 상관없이(regardless) 같은 결과가 나온다는 것이다. 따라서, LLM 들이 parameter knowledge 와 external information 사이의 conflict 가 있을 때, 어떻게 merging 하여 사용할 것인가에 대해 critical challenge 가 있음을 보인다. 이 과정에서 confirmation bias 가 아닌 text similiarity 가 LLM 이 context 를 선정하는 key factor 임을 보인다.

2. Background & Study Formulation

2.1 Background

Retrieval approach, generation-augmented approach, 그리고 hybrid approach 에 대한 도식은 아래와 같다.

2.2 Answer Tracing Task

저자들은 answer 가 generated context 와 retrieved context 중 어떠한 것에서 비롯되는지를 탐구하는 answer tracing task 를 제안한다. Task 를 풀 때는 LLM zero-shot setting 을 활용한다.

3. Experimental Setup

3.1 Context-Conflicting Datasets

실험을 위해, retrieved context 와 generated context 사이에 정답이 하나만 존재하는 context-conflicting dataset 을 만든다. 그 생성 criteria 는 Traceability (ANSWER는 반드시 어떠한 context 에 support 된다) 와 Exclusitvity (ANSWER 는 반드시 둘 중 하나의 context 에만 support 된다) 이다.

데이터 생성에는 NaturalQuestion (NQ) 와 TriviaQA 의 golden answer 를 활용하였다.

Step 1. Context Preparation
Retriever 로는 Contriever 의 top-1 ranked passage 를 활용한다. 참고로 Contriever 는 최근 RALM 에도 사용되는 명실 상부 강력한 off-the-shelf retriever 중 하나이다.

Generator 로는 GENREAD framework 을 따라 LLM 을 활용한다. 재현성을 위해 temperature 는 0 으로 한다. 대부분 Retriver 가 100 work 정도 context 를 가져오는데 반해, generator 는 250 word 가 넘게 길게 생성하는데, 이 length discrepancy 도 하나의 potential effect 일 수 있으므로 3% 정도의 discrepancy 가 되게 length constraint 를 부여한다.

Step 2. Sample Flitering
Traceability 를 확보하기 위한 filtering 과정을 거친다. 즉, ANSWER 가 Retrieved context 와 Generated Context 중 하나에라도 support 되는 것만 남기고 버려진다. ANSWER 가 둘 중 하나라도 support 되지 않고, intrinsic parameter knowledge 에 의존하는 경우는 버리는 것이다.

Step 3. Building Dataset
Exculsivity 를 확보하기 위해 ANSWSER 가 only one context 에 의존하는 case 만 남기고 filtering 한다. 이 떄, Retrieved context 에 의존하는 경우를 AIR 로, Generated context 에 의존하는 경우를 AIG 로 명명한다.

3.2 Statistics of Datasets

Generator 와 Reader 로 활용된 LLM 모델에 따른 statitsics 는 위의 표와 같다. NQ 와 TriviaQA 의 10% 정도 내외의 작은 portion 만이 해당되는 것을 볼 수 있다. GPT-4 는 conflicting instance 의 양이 적은데, 이는 retrieved or generated context 를 활용하는 능력이 뛰어나기 때문이라고 해석한다

3.3 Evaluation Metric

DiffGR 이라는 [-1,1] scale 의 metric 을 제안한다. AIR 케이스, 즉 answer 가 retrieved context 에서 온 경우에 대하여, Ideal LLM 의 DiffGR 값은 -1 이 될 것이다.

4. How LLMs Merge Contexts?

4.1 LLMs Prefer Self-Generated Contexts

EM Results

LLM 이 AIR 데이터셋에 매우 낮은 성능을 보이면서 AIG 에서는 매우 높은 성능을 보여 generated context 에 매우 크게 의존함을 알 수 있다.

DiffGR Results

Ideal LLM 이라면 AIG 의 경우 1, AIR 의 경우 -1이 나와야 하는데, 위의 그래프에서 AIR 도 양수가 나오기 때문에, AIR 을 잘 못하고 Generated context 에 크게 의존함을 알 수 있다.

4.2 LLMs Broadly Prefer Generated Contexts

4.1 의 결과는 LLM 이 스스로 만든 self-generated context 를 선호하는 경향성을 확인시킨다. 그렇다면 다른 LLM 이 만든 generated context 에도 의존할까?

) LLMs also biased towards contexts generated by other LLMs.
)LLMs usually exhibit a stronger bias to contexts generated by themselves.

5. Why LLMs Prefer Generated Contexts

이 절에서는 Confirmation bias, text similarity, context completeness 세 가지 측면에서 why LLMs prefer generated contexts rather than retrieved contexts from several perspectives 를 분석한다.

5.1 Effect of Confirmation Bias

한 연구([9]) 에서 parametric knowledge 에 consistent 한 context 를 선호한다는 발견이 있었다. 저자들은 single LLM 을 generator&reader 로 쓰는 경우, generated-context 를 paramteric kenowledge 라고 해석하고, confirmation bias 가 generated context preference 에 영향을 미치는지 분석한다.

저자들은 generated context 가 LLM’s parametric knowledge 에 align 되는 것을 방해하고자, counter-memory context 를 만든다. 이 것은 original generated context 와 답이 다른 answer 로 이뤄진다. 이 counter-memory context 를 활용하여 DiffGR 을 새로 측정한다.

위의 표에서, LLM 의 parameteric knowledge 에 inconsistent 한 counter-memory 에서도, 여전히 generated context 를 선택하는 경향을 보인다. 따라서, confirmation bias 는 key factor 가 아님을 확인한다. 특히, GPT-3.5 의 경우, TQA-AIR 에서 counter-memory 의 경우에서도 무려 0.8010 DiffGR 점수를 보여, 맹목적으로 generated context 를 좇는다는 것을 볼 수 있다.

5.2 Effect of Text Similarity

두 번째로, context 와 question 의 text similarity 가 영향을 미치는지 분석한다. Text similirity metric 으로는 BERTScore 와 Jaccard Similarity 로 semantic, lexical similarity 를 모두 분석한다.

위의 결과에서, retrieved context 의 similarity 가 모두 낮아, text similarity 와의 연관성이 큰 것을 확인할 수 있다. 추가적으로, 아래의 simialrity gap 을 정의하여,

실험한 결과는 아래와 같다.

두 결과를 통해 아래의 결론을 낼 수 있다. “LLMs exhibit an increased bias to generated contexts on slices with a larger average similarity gap”

5.3 Effect of Context Completeness

Retrieval 과정은 보통 fixed length truncation 을 차용하여 가져오기 때문에 context 의 완성도(completeness)가 혹시 LLM 의 선택에 영향을 미치는지 분석한다. 아래의 표에서 처럼, Nature 방법은 truncation 하지 않은 것과, 토큰 단위로 truncation (문장이 잘릴 수 있음), 문장 단위로 truncation (문장이 잘리지는 않음) 으로 실험을 세팅한다.

결과는 위의 표와 같다. Truncation 과 S-Truncation 을 비교했을 때, 실험 결과가 크게 차이나지 않기 때문에, 문장의 완성도 자체는 key factor 가 아니다. 그러나, 앞서 언급했던 generated context 의 length contraint 를 없애고, retrieved context 에 비해 훨씬 긴 context 를 생성하여 활용하게 하였을 때, (Nature vs S-Trunc) 큰 차이를 보인다. 따라서 아래의 결론을 낼 수 있다. **“LLMs tend to favor contexts with enhanced semantic completeness” **

Conclusion & Future Work

In this study, we propose a framework to investigate the underlying mechanisms by which LLMs merge retrieved and generated contexts. Our results reveal a pronounced bias towards generated contexts in several LLMs (GPT 3.5/4 and Llama2- 7b/13b). We further identify two key factors that may contribute to this bias: higher similarity between generated contexts and questions, and the semantic incompleteness of retrieved contexts.
Our insights highlight the critical need for advanced integration methods that can validate and leverage information from both sources, moving beyond the current overreliance on generated contexts. Additionally, we find that LLMs display significant sensitivity to the semantic completeness of input contexts. This sensitivity necessitates improved passage segmentation strategies in current retrievalaugmented systems, thereby ensuring the preservation of intended meaning and the maximization of utility. Finally, addressing the challenges posed by highly relevant yet incorrect information generated by LLMs is an important direction for future research. It is crucial to develop methods for detecting and discounting misleading information produced by LLMs, especially as the volume of such content continues to escalate.

Limitation

Our work has the following limitations:
• This study is confined to open-domain question answering, a representative knowledge-intensive task. The behavior of LLMs across a broader spectrum of natural language processing tasks remains to be further explored.
• This work does not propose specific solutions to effectively mitigate the observed bias, as we focus on revealing the phenomena and analyzing the causes.
• To create a controlled environment conducive to analysis, we utilize a single instance for each context type. LLMs face increasingly intricate conflict scenarios when handling multiple contexts from each type. These conflicts emerge not only between retrieved and internally generated contexts but also among the various contexts originating from the same source (Chen et al., 2022; Xie et al., 2023).

[EMNLP2023] IfQA: A Dataset for Open-domain Question Answeringunder Counterfactual Presuppositions

Mon, 11 Mar 2024 09:00:00 +0000

[pdf] [github]

Wenhao Yu^♦, Meng Jiang^♣, Peter Clark^♠, Ashish Sabharwal^♠
^♦ Tecent AI Seattle Lab ^♣ University of Notre Dame ^♠ Allen Institute for AI

Abstract

(lack of counterfactual QA dataset) counterfactual reasoning 이 매우 중요하지만, large-scale counterfactual open-domain question answering (QA) dataset 이 부족하여, model 을 평가하기 힘들다.
( IfQA ) 모든 question 이 ‘if’ 를 통한 counterfactual presupposition 에 기반한 IfQA 벤치마크를 introduce 한다. 이 Question 들은 parameter 속의 진실과 반대되는 imagined situation 에 대해서도 right information 을 identify 할 수 있어야한다.
(Experiment) supervised retrieve-then-read pipeline 모델들에 대하여, 낮은 점수를 보이며, ChatGPT 를 활용한 Chain-of-Thought 을 활용해도 여전히 challenging 한 open-domain QA benchmark 이다.

1. Introduction

▶Counterfactual reasoning
Counterfactual reasoning 은 실제 일어났거나 factually true 와 반대되는 어떠한 일들의 연속에 대해 possible alternative 를 imagine 하는 human tendency 를 뜻한다. 예를 들어, business area 의 corporate leader 들은 alternative investment strategy 를 취했을 때의 potential ripple effect 를 고려하여 의사결정을 하는데, 이러한 가정이 counterfactual reasoning 이다. AI 모델이 이러한 반대되는 가정을 할 수 있는 능력을 갖추는 것은 매우 중요하지만, 현재 open-domain QA Task 에서 이러한 counterfactual 가정을 다루는 task 는 전무하다. 대부분의 open-domain QA 는 internet 등의 global resource 에서 정보를 취득할 수 있는 question 을 푸는 것에 집중할 뿐이다.

그러나, counterfactual presupposition 은 causal intervention 으로 해석될 수 있는데, given presupposition 에 대해 human reader 들 사이의 shared background knowledge 를 따라야만 하기 때문이다. 모델들은 이러한 imagined situation 에 대해서도 정확한 정보를 retrieve 한 후 해석을 할 수 있는 능력을 갖추어야 한다.

▶ IfQA
몇몇의 연구에서 counterfacutal evidence 가 주어졌을 때, 이것을 인지(identify)하고 수정(correct)하려는 시도의 연구가 있었지만, open-domain QA scenario 에서 counterfactual reasoning capability 를 발전시키고 평가하려는 시도 자체가 없었다. 이에 저자들은, IfQA 라 불리는 3,800 개의 질문들로 이뤄진 counterfactual presupposition benchmark dataset 을 만들어제안한다.

위의 Figure 에서 예시를 볼 수 있다. IfQA 는 causal inference question 을, factual text sources 와 결합한다.

IfQA 는 retrieval 과 reading 에서 새로운 challenge 를 제안한다. 예를 들어, 위의 figure 의 2번째 예시에서, search-reasoning 과정은 네 개의 스텝으로 나뉜다. (i) [search] 에베레스트 산의 현재 높이 (8848M) , (ii) [calculate] 8848-300 = 8548, (iii) [retrieve] second-heighst mountain K2’s 현재 높이 (8611M), (iv) [compare] 두 산 중 높은 산의 높이를 generate : K2

▶ Experiment
IfQA 에서 inital performance level 을 확립하기 위해, 즉 baseline 성능을 제시하기 위해, 저자들은 state-of-the-art close-book and open-book model 을 평가한다. Closed-book model 로는 ChatGPT 의 CoT 능력을 활용하고, open-book model 로는 RAG 와 FiD 와 같은 retrieve-then-generate 모델을 활용한다.

실험 결과, IfQA가 retrieval 과 reading 에서 모두 challenging 한 dataset 임을 보인다. 특히 몇 가지 특별한 발견을 하는데, (1) retireval 에서 semantic matching 을 기반으로하는 전통적인 dense retrieval method 는 counterfactual presupopsition 과 실제 factual evidence 사이의 discrepancy 를 잘 capture 할 수 었었으며, (2) FiD 와 같은 state-of-the-art reader model 들은 gold passage 가 주어져도 50% 정도의 F1 score 를 기록할 정도로 어려워했다. 또한, (3) closed-book CoT reasoning 은 end-QA performance 를 향상시킬 수 있었지만, 여전히 open-book model 보다 성능이 매우 뒤쳐진다. 마지막으로, (4) passage retreival 과 large model reasoner 를 결합하는 것이 가장 좋은 성능을 보인다는 것을 보인다.

2. IfQA : Task and Dataset

2.1. Dataset Collection

모든 dataset collection 은 Amazon Mechanical Turk (AMT) 를 활용하여 이뤄졌다. ※ 자세한 크라우드소싱 관련 내용은 논문 참고.

Annotation protocol 은 아래의 세 가지 과정으로 이뤄진다. 우선, (i) counterfactual qeustion 을 수정할 수 있을 것 같은 Wikipedia 로 부터 passage 를 extract 한다. (ii) 이후, 크라우드소싱을 활용하여 counterfactual reasoning 을 만들고 (iii) additonal worker 를 통해 correctness 와 quality 를 평가한다. Annotation 을 위한 task form 은 아래와 같다.

(1) Question and Answer Annotation

Passage Selection 우선, Wikipedia 에서 causal event 와 관련된 passage 만 filter out 하여 남긴다. Specifically, “lead to, cause, becuase, due to, originally, initially” 와 같은 causality keyword 를 활용하여 filtering 을 진행한다. Randomly selected passage 와 비교하여, 이러한 filtering 기법이 question annotation 의 difficulty 를 압도적으로 낮춰준다고 주장한다.
Question Annotation Human Intelligence Task (HIT) 의 question annotation process 에서 유연성을 확보하기 위하여, worker 들은 20 개의 Wikipedia passage 중 10개의 passage 를 골라서 question 을 annotate 할 수 있다. (Worker 들에게 몇 개의 example 이 제공될 때 question 의 quality 가 좋으며, 유연하게 다양한 example 이 주어지면 더 좋은 quality 의 question 이 생성되었다고 한다) Diverse 한 question example 이 주어지지 않으면 worker 들이 기존의 question 을 mimic 하려고만 하는 경향이 있어서, annotation task 의 example 을 다양하게 만들어서 그 중 5개 정도를 보여주었다고 한다. 또한, free-form 으로 question 을 작성하게하여, template 에 국한되지 않게끔 유도하였으며, 20.6% 정도의 question 이 free-form 으로 생성되었다고 한다.
Answer Annotation 마지막으로, 생성된 question 에 대하여 appropriate 한 경우에 한하여 valid answer 를 작성하게 한다.

(2) Question and Answer Verification
Quetion verification 은 아래의 세 가지 질문을 통해 이뤄진다.

Q1: Is this a readable, passage-related question?
Q2: Is the question not well-defined without the Wikipedia passage?
Q3: Is the given answer correct? If not, could you provide the correct answer to the question?

(3) Answer Post-processing
Question 과 Answer 가 free-form 으로 작성되었기 때문에 Formalize 등의 post-processing 과정을 거친다. 예를 들어, “USA”, “U.S.A” 등 다양한 alias 경우를 “United States of America” 로 통일하거나, “5” 를 “five” 로, “30” 을 “thrity” 로 통일하는 등의 간단한 후처리 작업을 진행한다.

2.2. Dataset Analysis

Answer Type and Length
IfQA Benchmark 는 Answer 를 기준으로 네 가지 type 으로 나뉜다 : entity(49.7%), datae (14.5%), number(15.9%), others (19.9%) 아래의 표에서 예시들을 볼 수 있다. Answer 들은 평균 1.82 words 정도의 짧은 답변으로 이뤄진다 (NQ (2.35 words), TriviaQA (2.46 words), and HotpotQA (2.46 words) 에 비해 짧은 answer)

Question Type and Length
Question type : what(51.7%), who(14.6%), wh en(5.1%), which(10.1%), where(3.5%), how many/much (12.0%) 의 7 가지 type 으로 나뉜다. Question 평균 길이는 23.2 words 로 NQ (9.1 words), TriviaQA (13.9 words), HotpotQA (15.7 words) 등의 기존 open-domain QA 에 비해 counterfactual presupposition clause 를 포함하고 있어 더욱 길다.

Span vs. Non-span Answer
근거가 대부분 Wikipedia 에 있기 때문에, 75.1% 에 해당하는 답변들이 passage 속의 span 이고, mathematical reasoning 과 같은 (위 table 의 두 번째 예시) 경우나 passage 속의 여러 span 들을 합쳐야 하는 경우( 위 talbe 의 세 번째 예시) 등의 경우에서 Non-span answer 를 볼 수 있다.

2.3. Dataset Splits

저자들은 Datset 을 두 개의 official split 으로 나누어 제공한다. 하나는 supervised learning (IfQA-S) 로 일빤쩎인 SPLIT 이다. (train-dev-test : 2400-700-700) 또한, LLM 의 최근 성능을 확인하기 위해 few-shot setting 이 중요해져, 이러한 모델들이 counterfactual presupposition 을 학습할 수 있게 natural test bed 를 제공할 필요성이 증가하였다. 따라서 저자들은, few-shot learning 을 위한 another split (IfQA-F) 를 만들어, train 에 600 개만 투자하고, 나머지 dev-test 에 1600 개씩을 투자하는 split 도 제공한다.

3. Experiments

3.1. Retrieval Corpus

2022-05-01 기준의 Wikipedia dump 를 활용하며, 이는 6,394,390 page 를 가진다. 100 word 이하 passage 등을 제거하는 기존 연구들의 과정을 따라하여, 최종적으로 27,572,699 million passage 를 얻는다.

3.2. Comparison Systems

Closed-book models Codex 와 ChatGPT 를 활용하며, given question 을 encode 한 후, 어떠한 external knowledge 도 활용하지 않고 답한다. Direct answering 대신 Chain-of-Thought (COoT) 를 활용하여 final answer 를 얻는다.
Open-book models BM25 와 DPR retrieve 를 활용하여 Wikipidea 정보를 Retrieve 해온 뒤, FiD, RAG 같은 state-of-the-art retriver 에 T5 를 통해 answer 를 생성한다.

3.3. Evaluation Metrics

Retrieval performance : Recall@K (R@K)
End-QA performance : EM, F1

3.4. Implementation Details

※ 논문참고

3.5. Results and Discussion

(1) Retrieval in IfQA is challenging.

Recall@20 에서 60% 정도를 얻어, 40% 의 question 에 대해서는 supprotive evidence 를 얻지 못하였다. IfQA benchmark 는 몇몇 unique feature 를 보인다. 하나로는, 다른 QA dataset 에 비해 더 긴 question 을 가진다는 것이고, 이러한 긴 question 은 BM25 등의 keyword matching 기반 retrieval method 에는 좋은 소식이지만, DPR 같은 semantic matching 기반 method 에는 좋지 않은 소식이다. 다른 하나로는, counterfactual presupposition 과 factual evicdence 사이의 discrepancy 로 semantic matching 이 좋지 않은 검색 결과를 보인다. 예를 들어, “해수면이 빠르게 높아지만 가장 먼저 submerge 되는 국가는 어디인가?” 라는 질문에는, “가장 고도가 낮게 위치한 국가가 어디인가”를 검색해와야 하지만, “해수면”, “상승”, “submerge” 등의 단어에 집중하여 검색해오는 경향이 있다.

(2) Reading and reasoning in IfQA are challenging.

Retrieval 과 별개로 Reading 에서도 model 들이 힘들어하는 것을 볼 수 있다. 위 그림에서 오른쪽 글미과 같이, FiD 같은 state-of-the-art reader model 역시 strugle 하며, golden passage 가 주어진 겨웅에도 40% 정도의 정확도를 보인다. 따라서, FiD 가 대부분의 open-domain QA benchmark 에서 state-of-the-art 성능을 보이더라도, IfQA 에서의 reasoning module 을 poor performance 를 보인다고 해석할 수 있다. 또한, numeriacl reasoning 과 같은 complex reasoning 에서 더욱 낮은 성능 (32%) 을 보이는 것을 볼 수 있다.

(3) Chain-of-thought improves LLMs’ counterfactual reasoning.

Complex reasoning task 에서 강점을 보이는 Chain-of-Thought method 답게, CoT 가 LLM 의 counterfactual reasoning 성능을 크게 끌어올린다. 그러나, 여전히 non-parametric knowledge 에 대한 정보가 부족하기 때문에, state-of-the-art retrieve-then-generate model 인 FiD 등에 비하면, closed-book 은 CoT 를 붙인 LLM 이어도 여전히 부족하다.

(4) Passage retriever + Large model reasoner performs the best on IfQA.
따라서 마지막으로, BM25 나 DPR 같은 retriever 를 활용한 뒤 LLM (ChatGPT) 에 few-shot 으로 했을 때, 큰 성능향상을 보였고, FiD 등의 기존 SOTA retrieve-then-generate 모델을 상회하였다.

(5) Case Study

Conclusion

We introduce IfQA, a novel dataset with 3,800 questions, each of which is based on a counterfactual presupposition and has an “if” clause. Our empirical analysis reveals that IfQA is challenging for existing open-domain QA methods in both retrieval and reasoning process. It thus forms a valuable resource to push open-domain QA research on both retrieval and counterfactual reasoning fronts.

Limitations

The main limitation of IfQA dataset is that it only covers event-based questions, due to the nature of creating counterfactual presuppositions. Therefore, our dataset is not intended for training general opendomain QA models or evaluate their capabilities. For data collection, we relied heavily on human annotators, both for question annotation and verification. Despite our efforts to mitigate annotator bias by providing explicit instructions and examples and by sampling annotators from diverse populations, it is not possible to completely remove this bias. Besides, we use heuristic rules to select only a small portion of Wikipedia passages and then present them to human annotators, which might lead to pattern-oriented bias in the annotated data.

초록색볼드체

초록색배경 빨간색배경

[EMNLP2023] SMoP: Towards Efficient and Effective Prompt Tuning with Sparse Mixture-of-Prompts

Fri, 08 Mar 2024 06:00:00 +0000

[pdf] [github]

Joon-Young Choi, Junho Kim, Jun-Hyung Park, Wing-Lam Mok, SangKeun Lee
Department of Artificial Intelligence, Korea University, Seoul, Republic of Korea

Abstract

(Ineffeciency in Prompt tuning) Prompt tuning 은 finetuning 을 대체하는 효율적인 학습 방식이지만, 기존의 prompt tuning 은 100 token 이상을 사용하여 inefficiency 가 존재한다.
( SMoP ) 저자들은 SMoP (Sparse Mixture-of-Prompts )라는, short soft prompt 를 활용하는 방법론을 제안한다. SMoP 는 data 의 다른 subset 을 각각 specialized handling 하는 short soft prompt 여러 개를 gating mechanism 을 이용해 학습에 활용한다.
(Experiment) 실험 결과, SMoP 는 training-inference cost 를 줄이면서 basline method 를 outperform 한다.

1. Introduction

▶ Prompt tuning
Prompt tuning 은 Fine-tuning 을 대체할 parameter-efficient alternative tuning 방식으로 최근 주목을 받고 있다. 이 방식은 보통 기존의 LM param 은 freeze 하고 soft prompt 를 solely tuning 하여 mode input 앞단에 prepend 하는 방식으로, 효율적이면서도 강력한 성능을 보여준다. 여러 prompt tuning 기법이 제안이 되는 과정에서, 더 나은 성능을 보이기 위해 더 긴 prompt 가 사용이 되어 왔다. 최근에는 typically 100 token 이 넘는 soft prompt length 가 model performance 향상에 좋다고 알려졌지만, 그 computational requirement 에 대한 고려는 거의 없었다.

▶ SMoP : Sparse Mixture-of-Prompts
이에 저자들은 SMoP(Sparse Mixture-of-Prompts) 라는 방법을 제안한다. SMoP 는 training 와 inference 단계에서 short soft prompt 를 활용한다. Sparsely-Gated Mixture-of-Experts (MoE) 에 영감(inspriation) 을 받아서, 각각 data 의 subset 에 specialized handling 이 가능한 short soft prompt 여러개를 활용하는 방법이다.

아래 그림에서, 기존의 prompt tuning 은 100 토큰이 될 때, 오히려 Training memory 를 finetuning 보다 더 사용하기도 한다. 그러나 SMoP 는 그러한 문제가 전혀 발생하지 않는 효율적이면서도 좋은 성능을 보이는 방법론이다.

실험결과, SMoP는 SuperGLUE benchmark 에 대하여, T5-base 와 T5-large 에 대해, 기존의 prompt tuning 방법론보다, training time, memory, inference computation 에서 효율적이면서도 좋은 성능을 보인다.

2. Method

2.1. Preliminaries

Full Fine-tuning

Sequence-to-Sequence model : $p_{\phi}(y x)$ parameteriezed by $\phi$
embedding : $X={x_1, x_2, …, x_n } \in R^{n \times e}$
label : $Y$
objective of full-fintuning :

Prompt Tuning

soft prompt length : $l$
soft prompt embedding : $P_\theta$
objective of prompt-tuning:

; : concatentation 위의 Figure2 (a) 에서 prompt tuning 을 볼수 있다.

2.2. SMoP: SParse Mixture-of-Prompts

The goal of SMoP is to train multiple short soft prompts, where each prompt is specialized in a subset of the data. SMoP는 각각 data subset 에 specialized 된 여러 개의 short prompt 을 학습한다. 이를 위해, 위의 Figure2 (b) 처럼 Gatining mechanism 을 도입한다.

Gating mechanism 에서는 small linear router model $L_u$ 를 도입한다. 이 모델은 $u \in R^{e \times k}$ 로 parameterized 되어 있다. 라우터 모델은 어떤 soft prompt 가 routed 되어 input 으로 들어갈지를 결정한다. $k$ 개의 soft prompt embedding $P_{\theta_1}, P_{\theta_2}, …, P_{\theta_k}$ 에 대해, 라우터 모델은 그 input average embedding $X$ 를 이용하여 routing probability $p_1, p_2, …,p_k$ 를 계산한다.

이후, highest prob 을 갖는 soft prompt 가 입력으로 routed 되어 들어간다. 따라서 SMoP 의 objective 는 아래와 같다.

c : index of the prompt with the highest probability value

2.3. Router Perturbation

기존의 Mixutre-of-Experts(MoEs) 논문에서, 학습 단계에서의 experts 사이의 balance 조정을 통해 성능을 끌어올린 것처럼, soft prompt 사이의 load balance 를 조절한다. 이를 위해 SMoP 학습에서 router pertrubation 을 도입하여, Gaussian noise 를 주입한다.

따라서 위의 prob 계산 과정에서 Gaussian pertrubation 이 추가된다.

3. Experiments

3.1. Experimental Settings

Tasks : SUperGLUE
Backbone Models : T5-base, T5-large
Baseliens : Prompt tuning (Lester et al.), P-tuning(Liu et al.), full fine-tuning
Eval setup : prompt tuning : length {5,20,50,100} SMoP : {1,3,5,10}
- report training time, memory usage, FLops for inference cost

3.2. Results

Main Results

SMoP 가 highest performance 를 달성한다(Average 2.5%, T5-large 에서 3.4% 향상)
SMoP 가 더욱 효율적이다(14.6% training time, 22.9% training memory, 27.2% inference FLOPs in T5-large).

Length and Number of Soft Prompts

Best performance 는 4 soft prompt (k=4) 에 length 5 (l=5) 일 때이다.
prompt length 가 너무 길면 (50 이상) 성능이 좋지 않고, 20 개 이상의 prompt 를 쓰는 것은 도움이 안된다.

Routing Methods

다양한 routning method 에 대한 비교에서 SMoP 가 활용하는 routing 기법 (top-1 with gaussian perturbation) 이 가장 성능이 좋다.

Conclusion

We have presented SMoP (Sparse Mixture-ofPrompts), a novel prompt tuning method that utilizes short soft prompts for efficient training and inference while maintaining performance gains associated with increased prompt length. To achieve this, we have employed a gating mechanism in SMoP that routes each instance to one of the multiple short soft prompts. Experimental results have demonstrated that SMoP has outperformed prompt tuning while reducing training and inference costs through the utilization of short soft prompts.

Limitations

Given the same total prompt length, the gating mechanism of SMoP introduces additional parameters compared to prompt tuning, inducing additional storage requirements. Comparing prompt tuning with a soft prompt of length 20 (20,480 trainable parameters) and SMoP with 4 prompts of length 5 (24,576 trainable parameters) on T5-base, SMoP adds 20% trainable parameters and such difference increases as more prompts are utilized. We further note that SMoP is orthogonal to most of the existing prompt tuning methods including prompt transfer learning methods (Vu et al., 2022; Asai et al., 2022; Wang et al., 2023) as mentioned in Section 4. While our investigation has highlighted the significance of incorporating short soft prompts through sparse activation in conventional singletask prompt tuning, we believe that SMoP holds promise as a valuable direction for augmenting the efficiency of prompt tuning methods in the future.

[EMNLP2023] Ditto: A Simple and Efficient Approach to Improve Sentence Embeddings

Wed, 06 Mar 2024 08:00:00 +0000

Qian Chen, Wen Wang, Qinglin Zhang, Siqi Zheng, Chong Deng, Hai Yu, Jiaqing Liu, Yukun Ma, Chong Zhang
Speech Lab, Alibaba Group

[pdf]

Abstract

(anisotropy bias in BERT sentence embedding) 저자들은 BERT sentence embedding 이 uninofrmative word 에 대한 anisotropy bias 가 있어, semantic textual similarity (STS) task 를 수행하는데 어려움이 있음을 지적한다.
( Ditto ) 이것을 해결하기 위해 저자들은 Diagonal Attention Pooling (Ditto) 라는 unsupervised approach 를 제안한다. 이 방법은 model-based importance estimation 을 통해 word 의 weight 을 계산하고, 이후 이 weight 의 average 를 통해 sentence embedding 을 얻는다. Ditto 는 BERT 뿐 아니라 어떠한 PLM 에도 적용될 수 있다.
(No use of param) 다른 sentence embedding 들과 다르게 Ditto 는 어떠한 추가적인 parameter 도 요구하지 않는다.
(Experiment) Ditto 는 BERT 와 다르게 anisotropy bias 문제가 발생하지 않으며, 따라서 STS task 에서 좋은 성능을 보여준다.

1. Introduction

▶ Bias in BERT sentence embedding
BERT, RoBERTa, ELECTRA 등의 Pre-trained language models (PLMs) 이 매우 좋은 성능을 보여주는 것은 사실이지만, 여러 연구에서 BERT 의 sentence embedding 이 GloVe 보다도 좋지 못하다는 주장이 제시되었다. 특히 anisotropy bias 이 심하다는 문제가 제기되었는데, 이는 original BERT 가 생성하는 sentence embedding 이 어느 pair 에 대도 높은 similarity 를 보인다는 문제점이다. 이는 BERT sentence embedding 을 활용하여 Semantic Textual Similarity (STS) task 를 푸는데 문제가 될 수 있다.

▶ Improving sentence embeddings from PLMs
PLM 의 sentence embedding 을 발전시키는 방법은 크게 세 가지로 분류 된다.

(1) learning-free method

anisotropy bias 가 token frequency 같은 tatic token embedding 에서 비롯되었다고 보고, static remove biases avg 방법론을 통해 top-frequency token 들을 없애고 남은 token 들의 average 로 embedding 을 구성하는 방법으로 해결한다.

이 방법은 BERT 의 contextualized representation 을 활용하지 않기 때문에, informative word 가 적을 수 있는 short sentence를 잘 표현하지 못한다는 단점이 있다.

또한, prompt 를 이용한 learning-free method 가 존재하는데, 이는 “This sentence: [original sentence] means [MASK]” 라는 prompt 에서 MASK 토큰을 채우는 방식이지만, 이는 input length 가 길어져 cost 가 많이 들며, ELECTRA 같은 MASK 토큰을 쓰지 않는 모델에는 적용될 수 없으며, prompt 에 크게 의존하여 reliability 가 떨어진다는 단점이 있다.

(2) extra-learning method

PLM 의 parameter 는 고정하고, 추가적인 학습을 통한 방법이 두 번째이다. 대표적으로, BERT-flow 가 있고, 이는 flow-based generataive model 을 도입하여 BERT 의 anisotropy problem 을 해결하는데, BERT sentence embdding distribution 을 smooth and isotropic Gaussian distribution 으로 transform 하는 방식이다.

(3) updates parameter

마지막은 BERT 를 포함한 PLM 의 param 을 update 하는 방법이다. 특히, NLI 와 STS dataset 을 통한 추가학습으로 이것들을 잘하게끔 sentence embedding 을 유도 학습하는 방법이다. SimCSE 등이 대푲거인 방법이다.

이 논문에서는 위의 방법들과는 다른 새로운 learning-free method 인 Ditto 를 소개한다.

2. Analyze BERT Sentence Embeddings

▶ Observation 1: The compositionality of informative words is crucial for high-quality sentence embeddings.
Perturbed masking 방법은 sentence 속의 token 두 개를 masking 하여, 각각의 토큰이 서로에게 어떠한 영향을 미치는지 분석하는 방법이다. 이 논문에서는 BERT 와 SBERT 에 대해서 분석을 해보는데, 아래의 그림과 같이

SBERT 의 경우, “social media”, “Capitol Hill” 같은 informative word 에 prominent vertical line 이 있는 것을 볼 수 있다. BERT 에서는 이러한 현상이 관측되지 않기 때문에, 저자들은 informative token 이 high-quality sentence embedding 의 strong indicator 라는 것을 가정한다.

또한 위의 TF-IDF 에서 word 의 중요도(importance) 측정에서도 비슷환 경향성을 보인다. SBERT 의 impact matrix 가 더 높은 TF-IDF 와의 correlation 을 보인다. ELECTRA 는 이 correlation 이 낮고, 역시 STS task 에서의 성능이 매우 안좋다. 이에 저자들은 BERT 와 ELECTRA 가 uninformative word 에 bias 되어있는 것이 문제라고 지적한다.

Observation 2: Certain self-attention heads of BERT correspond to word importance.

위의 표에서 TF-IDF 의 경우, BERT 는 ELECTRA 와 달리 준수한 correlation 을 보인다. 따라서 저자들은 BERT 에도 informative word 가 잘 encode 되어있지만, 외재적으로 발현이 되지 않았을 가능성을 지적한다.

위의 그림과 같이 BERT 를 분석한 결과, BERT 에서는 informative word 끼리 높은 “diagonal value” 를 가지는 것을 확인한다.

3. Diagonal Attention Pooling

위의 두 발견에 따라, 저자들은 Diagonal Attention Pooling (Ditto) 방법을 제안한다. 위의 Figure 와 같이, 기존의 BERT 에서 last hidden layer 까지의 hidden state 를 average 하는 것과 달리, 첫 번째 hidden layer 만 쓰거나, 처음과 마지막의 hidden layer 의 평균을 사용하여 sentence embedding 을 사용한다. 이후, Ditto 는 hidden state 를 특정 head 의 diaognal attention 을 이용하여 weight 하여 sentence embedding 을 구성한다. 따라서 Ditto 는 추가적인 학습 없이 sentence embedding 을 표현할 수 있는 learning-free method 라 효율적이다.

4. Experiments and Analysis

Ditto 는 매우 효율적이면서도 성능이 좋은 sentence embedding 방법론이다.

Head 별 Ditto 성능 비교.

Ditto 와 learning-free baseline 의 cosine similairty

Conclusion

We propose a simple and learning-free Diagonal Attention Pooling (Ditto) approach to address the bias towards uninformative words in BERT sentence embeddings. Ditto weights words with modelbased importance estimations and can be easily applied to various PLMs. Experiments show that Ditto alleviates the anisotropy problem and improves strong sentence embedding baselines.

[EMNLP2023] PK-ICR: Persona-Knowledge Interactive Multi-Context Retrieval for Grounded Dialogue

Mon, 04 Mar 2024 12:48:00 +0000

[pdf] [github]

Minsik Oh, Joosung Lee, Jiwei Li, Guoyin Wang

Abstract

(PK-ICR) Persona 와 knowledge 를 jointly idnetify 해야하는 새로운 task 인 Persona and Knowledge Dual Context Identification (PK-ICR)을 제안한다.
(Grounding Retrieval Method) Dialog 내의 모든 context 를 활용할 수 있는 새로운 grounding retrieval method 를 제안한다. 이 방법은 기존 QA retrieval model 보다 효율적이다.
(Null-positive rank test) 추가적으로, semantiaclly dissimilar sample 에 대한 ranking performance 를 측정할 수 있는 null-positive rank test 를 제안한다.

1. Introduction

▶ PK-ICR: Persona-Knowledge Interactive Multi-Context Retrieval for Grounded Dialogue
기존의 Dialog 연구에서는 대부분 Persona 에 대한 연구와 Knowledge Grounding 에 대한 연구가 독립적으로 시행되었었다. 이 연구에서는 두 가지를 jointly 다뤄야하는 Persona-Knowledge Daul Context Identification task 를 새로 제안한다. 위의 Figure 와 같이, Persona, knowledge, dialog 사이의 interaction 을 다룬다.

▶ Contributions
이 논문의 contribution 은 세 가지이다.

Persona and knowledge dual context retrieval methodology.
Framework for cross-task adaptation of dialogue context interactions.
Evaluating the hard-negative trait of Persona-augmented Dialogue

2. Methodology

이 Task 의 목적은 conversation turn 의 모든 component 들의 interaction 을 maximize 하는 것이다.

2.1. Knowledge Retrieval

위의 그림과 같이 {Persona} {Dialogue} 의 질문에 {Knowledge} 로 답하는 형태의 knowledge retrieval 을 해야 한다.

위의 식에서, E 는 input 이고, Q,A,P,K 는 QA candidate 과 persona, knowledge pair 이다. D 는 dialog 이다.

이 것을 활용하여, 모든 pair i,j 에서 best knowledge 를 찾아야 한다. 따라서, 아래와 같이

를 수행한다.

2.2. Persona Retrieval

Augmented Persona 를 이용하여 QA retrieval model 을 finetuning 한다.

(4) 식에서 2.1. section 의 결과를 토대로, E’ 에 대하여, 이 것을 이용하여 Model M 을 Finetuning 한다.

이후 아래와 같이, finetuned model Mf 를 활용하여, persona likelihood score 를 infer 한다.

최종적으로, retrieved grounding information 은 아래와 같이 정리된다.

2.3. Null-positive Rank Test

Score output 과 관계없이 sample 들에 대한 discriminative performance 를 solely 평가한다. Persona-augmented Dialog 를 hard-negative sampling

Can the model rank null-positive sample correctly in relation to non-trivially dissimilar augmented samples?

3. Experiment Setup

데이터셋은 Customized Conversation 을 활용한다.
Model 은 MS MARCO dataset 을 학습한 QA model 여러 가지를 활용한다.

4. Results

4.1. Knowledge Retrieval

Table 1 shows strong performance increase for our prompt input from dialogue-only model, confirming that all components of dialogue is important.

4.2. Persona Retrieval

Table 2 shows that fine-tuned Pi + D model has the best performance
Non-fine-tuned Pi + D model 에서는 낮은 성능을 보이는데, true knowledge 가 likelihood score 에 영향을 받기 때문이다.

4.3. Null-positive Rank Test

The performance of the model has increased in top-1 rank setting (0 threshold, 0-Acc) and all variants of non-triviality have improved for both models.

Conclusion

We introduce persona-knowledge dual context retrieval method PK-ICR in this paper. We perform QA-informed prompt-augmentations of data that successfully exploit the interactions between multiple dialogue components. We perform zero-shot top-1 knowledge retrieval and precise persona scoring. We present a novel evaluation method of nullpositive rank test as to isolate the hard-negative effect of Persona-augmented Dialogue. We obtain SOTA results on both retrieval tasks of the Call For Customized Conversation benchmark and report the alignment of the non-triviality metric with threshold-free performance. With our research, we hope to stimulate readers to model dialogue context as an interactive whole of multiple components.

Limitations

Our cross-task adaptation of dialogue grounding retrieval to QA task is limited in terms of the target task and our prompt construction. In addition, retrieval models informed by inductive bias for multi-context scenarios could further improve our methodology. We specifically study multi-context interactions and retrieval in dialogues, which is a relevant and novel problem for advancing broadly capable dialogue systems. As an extension to our research, future work could also report on modeling downstream generation tasks based on grounding interactions.

[ACL2023] A Synthetic Data Generation Framework for Grounded Dialogues

Wed, 28 Feb 2024 12:00:00 +0000

[pdf] [github]

Jianzhu Bao^1,5, Rui Wang^1,6, Yasheng Wang³, Aixin Sun², Yitong Li^3,4, Fei Mi³, Ruifeng Xu^1,5,6
¹ Harbin Institute of Technology, Shenzhen, China ² Nanyang Technological University, Singapore ³ Huawei Noah’s Ark Lab, Huawei Technologies Co., Ltd. ⁴ Peng Cheng Laboratory, Shenzhen, China ⁵ Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies

Abstract

(Motivation) 여타 다른 Dialogue 와 마찬가지로, grounded-dialog 를 generation 하는 것은 매우 costly 하다.
(SynDG) Wikipedia, persona profile 등의 freely available knowledge data 와 pre-trained Language model 을 활용하는 synthetic data generation framework 인 synDG 를 제안한다.
(Dialog FLOW) SynDG 의 key idea 는 Dialog flow 를 통해 coherence 를 유지하는 것이다.
(Two-level filtering strategy) Synthetic dialog 와 dialog flow 의 coherence 를 위하여 two-level filtering (flow-level and utterance-level) strategy 를 제안한다.
(Experiment) Full training data 와 low-resource scenario 에서 model performance 를 boost 한다.

Introduction

이 연구에서는 Grounded dialog system 을 다루는데, 이는 knowledge 에 relevant 하고 informative 한 reponse 를 제공하는 대화이다. 다른 여타 dialog 와 마찬가지로 grounded dialog 역시 데이터셋 부족 문제가 있다. 기존의 다른 방법들 (RL 을 활용하거나, user simulation 을 활용) 등이 제안되었지만, 이들은 Dialog flow 를 반영하지 않는다.

Dialog Flow 는 dialogue 의 outline 이라고 할 수 있다. Dialog flow 에는 각 session 에서의 content 와 trajectory (topic shift 등) 이 담길 수 있다. 위의 그림에서와 같이, “husky” 에서 “sled dogs” 로, 그리고 다시 “huskies as pets” 로 자연스럽게, dialog 가 흐르는 것을 알 수 있는데, 만약, “Esquimaux” 와 같이, husky 와 같은 wikipedia page 에 등장하지만 다른 knowledge peice 로 대체하면 flow 가 inconsistent 해진다. 따라서 가장 중요한 것은 dialog flow 을 정교하게 설계하여, coherence 와 smoothness 를 확보하는 것이다.

이에 이 연구에서는 Synthetic Dialog Generation (SynDG) 를 제안한다. 이렇게 생성된 dilaog 는 auxiliary training data 로써 활용될 수 있다. SynDG 는 Heuristic 을 통해 Wikipeida 와 persona knowledge 로 부터 dialog flow 를 만들고, T5 를 이용해 generation 을 진행한다. 이후, Flow-level, Utterance-level 의 two-level filtering 을 통해 quality assurance 를 진행한다. 이후 실험에서, 생성된 두 데이터셋을 additional training dataset 으로 활용하였을 때, 더 좋은 성능을 보여주었다.

Task Formulation

Training Grounded Dialog $D^t = (C^t_i,K^t_i,r^t_i )^{N_t}_{i=1}$ 에 대해, $C$ 는 dialog context, $K$ 는 knowledge, $r$은 response 일 때, $D$로 부터 $P(r|C,K)$ 를 학습하는 것이 목표이다. 이후 이를 통해, synthetic data $D^s$ 생성 후, ${D^t U D^s }$ 를 통해 generation model 이 나아지는지 확인한다.

Methodology

위의 그림은 SynDG 의 전체적인 Framework 구조이다. 세 가지로 이뤄져 있는데, (1) task-specific heuristic 을 통한 dialog flow 생성, (2) dialog flow 을 바탕으로 utterance realization (3) two-level filtering 을 통한 quality 확보이다.

1. Dialogue Flow Construction
Dialog Flow $F=(f_1,f_2,…,f_{n_f} )$ 는 각각 knowledge piece $f$ 들로 이루어진다. 각각 knowedge piece $f$ 는 Knowledge base $K$ 의 하나의 piece 거나, 여러 piece 들의 연속이거나, “[none]” 이 되어 knowledge 가 없을 수 있다. 각각 하나의 knowledge piece 가 utterance 가 되며, 홀수 번째는 첫 번째 speaker, 짝수 번째는 두 번째 speaker 의 utterance 가 된다. 학습과정에서는, 각각의 utterance 마다 knowledge piece 가 있으므로 손쉽게 flow $f$를 얻을 수 있다. 중요한 것은 Inference 단계에서, dialog flow 를 확보하는 방법이다. 논문에서는, heuristic 을 이용한다. PersonaChat 에 대해서는 persona utterance 들을 모아 Knowledge Base $\K$로 만든 후, 이 중 zero, one, or more persona sentence 각각을 $f$로 활용한다. WoW (Wizard of Wikipedia) 에 대해서는, chosen passage 와 첫 번째 turn 에서 retrieve 되는 passage 를 knowledge corpus $\K$ 로 한 뒤, 각각의 turn 에서 최소 한 개의 $f$ 를 추출해서 사용한다.

Heuristic 을 활용한 방법이 universally applicable 하지 않다는 것을 저자들도 인지하지만, minor modification 을 통해 모든 데이터셋에 적용가능하다고 주장하고 있다.

2. Dialogue Content Realization
Dialog flow 를 통해 utterance 를 생성하도록 T5 를 Finetuning 한다 $u_i$ 를 생성하기 위하여 $(u_1, u_2, …, u_{t-1},[t],f_i,[/t],f_{i+1},…,f_{i+m})$ 을 input 으로 한다. $[t]$ 와 $[/t]$ 는 $u_i$ 가 $f_i$로 부터 생성됨을 강조한다. Practically,_ [user]_ 와 [agent] special token 을 추가한다.

3. Two-level Filtering
T5 의 text-infilling task (masked sentence modeling) 을 통해 filter 를 학습한다. 마치 Dialog Reconstruction 와 마찬가지로, Training dataset 에서 utterance 와 flow 를 mask 하고 T5 기반 filter 가 맞추는 방식으로 학습을 한뒤, Inference 단계에서는 filter 가 내놓는 log prob 을 score 로 활용한다. 이 방식을 utterance 와 flow 에서 모두 적용한 뒤 합하여 최종 score 로 활용한다.

Experiment Settings

[Dataset] PersonaChat, WoW

[Baseline] Wow : BlenderBot KA (Knowledge Available) 은 GT-knowledge 에서 response 를 생성하고, KU (Konwledge Unavailable)은 knowledge selection 부터 진행한다. Knowledge selection 은 RoBERTa 를 finetuning 하여 활용한다.

PersonChat : GPT-2 based basline (1) GPT-2 : 일반적인 GPT-2 (2) GPT-2-BT : Cao et al. 에서 제시된 back translation 을 활용한 dialog data augmenation 적용 방법 (3) GPT-2-$D^3$ : $D^3$ 는 Cao et al. 에서 제시된 PersonChat 을 위한 data augmentation 방법이다.

[Eval Metrics] BLUE-4, ROUGE-L, PPL(Perplexity), F1 (only for WoW), KF1 (knowledge uni-gram overlapping), ACC (Knowledge selection for KU setting), Human Evaluation - (1) Human Likeness, (2) Informativeness

[Implementation Details] Dialog generator 와 filter 는 T5-large 를 활용한다. (T5-base 도 성능이 증가하지만 폭이 크지는 않다고 한다)

Experiment Results

Automatic eval results on WoW

SynDG 가 reponse generation 뿐 아니라, ground knolwedge ability 도 증가시켰음을 알 수 있고, two-level filtering 이 모두 improvement 에 contribute 한다. w/o FF&UF 가 w/o FF 혹은 w/o UF 보다 훨씬 degradation 이 심하고, 각각 역시 degradation 된다. BB-SynDG w/o FF&UF 가 Random Sampling 인 RS 보다 좋아서, Heuristic 이 도움이 됨을 알 수 있다.

Low-resource 에서는 효과가 더욱 극명한데, 특히 KA setting 에서 BB-SynDG 는 1/16 training dataset 만으로 BB 의 full training dataset 과 비교되는 성능을 보여, low resource problem 해결에 도움이 됨을 확인할 수 있다.

Automatic eval results on PersonaChat

PersonaChat 에서 역시 좋은 결과를 보인다. GPT-2-$D^3$ 가 augmentation 도 굉장히 정교하게 많이 하였지만, 그래도 SynDG 의 성능이 더 좋았다.

Human Evaluation

Impact of the Number of Synthetic Dialogues

How many synthetic dialogues are appropriate to integrate as extra training samples? 라는 질문에 대한 대답을 위한 실험 결과이다. BLEU-4 (a)와 ROUGE-L (b)은 지속적으로 좋아졌지만, KF-1 score (c)의 경우 처음에 rapid increase 를 보인뒤 stable 하다. 이에 저자들은 LM 의 scale 에 따라 한계가 있으며, augmentation 의 효과가 무기한적이라고 생각하지는 않으며, original data 의 두 배 정도일 때가 최대 가성비 효과 인 것 같다고 한다.

Conclusion

In this paper, we propose a framework, SynDG, to automatically construct synthetic training data for the grounded dialogue task. We first construct dialogue flows based on unstructured knowledge, then transform them into synthetic dialogues by large LMs, and finally filter and retain the generated dialogues with high quality. The experimental results demonstrate the effectiveness of our proposed framework in both full training data and low-resource scenarios. Further analysis shows that the model performance tends to increase as the number of synthetic dialogues increases. For future work, we plan to investigate more efficient strategies for determining dialogue flows and take larger LMs to produce synthetic dialogues with higher quality.

Limitation

여전히 SynDG 로 만든 Synthetic data 와 human-written dialog 사이에 quality 적인 gap 이 크다고 한다. 저자들은 더 큰 LM 을 쓰거나, knowledge graph 혹은 reasoning skill 을 도입하면 개선될 여지가 있다고 말한다.

[EMNLP2023] CLAIR: Evaluating Image Captions with Large Language Models

Mon, 26 Feb 2024 08:00:00 +0000

[pdf] [github]

David M. Chan, Suzanne Petryk, Joseph E. Gonzalez, Trevor Darrell, John Canny
University of California, Berkeley

Abstract

(Image CaptioningMetric) Image caption model 을 평가하는 metric 은 semantic relevance, visual structure, object interactions, caption diversity, specificity 등의 요소를 고려해야 한다.
( CLAIR ) 본 논문에서는 Large Language Model (LLM)의 zero-shot capability 를 leverage 하여 새로운 image captioning metric 을 제시한다.
(Experiment) CLAIR 는 높은 human correlation 을 보이는데, SPICE 보다 39.6%, RefCLIP-S 보다 18.3% 높은 human correlation 을 보인다.

1. Introduction & Background

▶ Image Captioning Metric
Image caption model 을 평가하는 metric 은 semantic relevance, visual structure, object interactions, caption diversity, specificity 등의 요소를 고려해야 하므로 challenging 하다. 기존에 n-gram 기반의 metric 들인 BLEU, CIDEr, SPICE 등이 제시되었고, 이후 모델을 기반으로 한 CLIPSCore, TIFA, SeeTrue, VPEval, 이환희 박사님의 연구인 UMIC 나 내 연구인 PR-MCS 등도 제시되었다. 그러나 기존의 metric 들은 낮은 human correlation 을 보이거나, 혹은 너무 costly 하여 metric 으로 활용하기 어려운 점이 있었다.

▶ CLAIR
최근 Large Language Model (LLM) 이 등장하면서, 매우 강력한 성능을 보인다. 이 연구에서는 이 LLM 의 강력한 “judge” 능력을 leverage 하여 CLAIR(Criterion using LAnguage models for Image caption Rating) 을 제안한다. 이는 단순하게 LLM 으로 하여금 caption 들에 대한 numeric rating 을 생성하게 한다. 저자들은 semantic text quality 를 LLM 에게 직접적으로 측정하게 하는 최초의 연구라고 주장한다.

MS-COCO, Flickr8k, PASCAL-50S 등의 대표적인 image captioning metric 들에 대한 실험 결과, CLAIR 가 아주 놀라울 정도로 강력한 human correlation 을 보인다. 또한, CLIAR_E 라는 Ensemble 모델이 더 높은 성능을 가진 metric 임을 실험적으로 보인다. 이 논문이 가지는 contribution 을 아래와 같다.

(1) Language-only model 로 vision-language task 를 평가할 수 있는 metric 을 제시한 점.
(2) LLM 이 단순 scalar rating 을 잘하는 것을 넘어, reasoning 을 기반으로 rating 을 할 수 있다는 점.
(3) LLM 이 image caption 을 평가하기 위한 여러 기준(criteria) 들에 대해서 대부분 다 반영할 수 있다는 것을 보인 점.

2. CLAIR: LLMs for Caption Evaluation

CLAIR 는 Image Captioning Metric 을 위해 text-only model 인 LLM 을 사용하기 때문에, human-readable text completion task 로의 전환을 시도한다. 위의 Figure 에 나와있는 Prompt 를 이용하여 text completion task 로 score 를 내뱉게 하며, temperature 를 0 으로 하여 (greedy) 재현성(reproductability)를 확보한다. 그리고 재현성을 위해, API 의 default inference parameter 에 zero-shot 으로 실험을 진행한다.

Backbone model 로는 GPT-3.5 (ChatGPT), Claude, PaLM 을 사용하고, Koala, Vicuna 와 같은 open-source model 을 사용하여 보았지만 이 open-source model 은 매우 나쁜 human correlation 을 보였다고 한다. Baseline metric 으로는 BLEU, ROUGE, METEOR, CIDEr 그리고 CLIP-Score 를 비교한다.

3. Evaluation & Discussion

몇 개의 Qualitative Results 는 아래의 그림과 같다. CLAIR 는 높은 human correlation 을 보일 뿐 아니라, 점수에 대한 근거(reasoning)까지 얻어낼 수 있다.

3.1. Sample-level human correlation

우선 sample-level 로 CLAIR 의 우수성을 보이기 위해, COMPOSITE, Flickr8K-Expert, MS-COCO 에 대한 실험을 짆애한 결과는 아래와 같다.

(1) CLAIR 가 n-gram 기반 metric 뿐 아니라 CLIPScore 에 비해서도 압도적으로 좋은 성능을 보인다.
(2) CLAIR_E 의 경우, inter-human agreement 와 0.1 정도밖에 차이가 나지 않는다.

3.2. System-level human correlation

System-level 로 CLAIR 의 우수성을 보이기 위해, 저자들은 5 개의 모델이 내뱉은 output 과 human eval 과의 correlation 을 측정 비교한다.

3.3 Decision making

3.4. Groups of Captions

4. Limitations

CLAIR 가 높은 human correlation 을 보이는 metric 임에는 분명하지만 아래의 네 가지 단점이 보인다.

Non-Determinism and Parsing Errors : LLM 이 output 을 하다보니 “As an AI language model, I cannot see, and thus, cannot determine if the image captions match the references” 와 같은 답변을 한다던지, malformed JSON output 을 내뱉기도 한다.
Increased Cost : 매우 비싸다. MS-COCO 의 경우 226 토큰 정도가 평균으로 쓰였고, GPT-4 로 할 경우, 하나에 $0.0067 을 소모한다.
Hallucination : 근거에 hallucination 이 발생한다. LLM 으로 생성하기 때문에 당연한 문제이다.

Conclusion

This work introduces CLAIR, an LLM-based evaluation measure for image captioning. CLAIR’s superior performance compared to highlyengineered measures indicates a remarkable fact: LLMs are well aligned with human judgments of caption quality, even more so than some measures designed specifically for semantic similarity. CLAIR is only a glimpse into how LLMs can be used for evaluation tasks, and image captioning is only the beginning. We hope that our work will inspire further exploration of similar measures in other vision and language domains, such as visual storytelling (Huang et al., 2016), where human evaluation of generated text remains a challenging task.

[EMNLP2023] TaskDiff: A Similarity Metric for Task-Oriented Conversations

Fri, 23 Feb 2024 14:00:00 +0000

[pdf] [github]

Ankita Bhaumik^†, Praveen Venkateswaran^∗, Yara Rizk^∗, Vatche Isahagian^∗
^† Rensselaer Polytechnic Institute, Troy, New York ^∗ IBM Research

Abstract

(TOD metrics) 많은 similarity metric 들이 제안되었 지만, task-oriented conversation 의 unique 한 특성을 알아내는 metric 에 대한 연구는 많이 진행되지 않았다.
( TaskDiff ) 이에 저자들은 TaskDiff 라는 conversational similarity metric 을 제안한다. TaskDiff는 utterances, intents, slots 와 같은 다양한 dialogue component 를 활용하여 그 distribution 을 통해 optimal transport 를 활용하여 계산된다.
(Experiments) 다양한 벤치마크에서 TaskDiff 가 superior performance 와 robustness 를 보인다.

1. Introduction

▶ A key aspect of conversational analytics
대화(conversation)에 대한 연구는 LLM 의 등장으로 가속화되었고, ChatGPT 나 LLaMA2 등을 활용한 assistant 들도 많이 등장하였다. 이것들을 통해 user-experience 가 develop 될 수 있다. 그러나, 여러 assistant 간에 누가 더 나은지를 측정하는 metric 은 연구가 충분히 이뤄지지 않는다.

▶ Textual similarity of Dialogue
Document 나 social media, transcript 등의 textual source 에 대한 similarity 측정은 이미 많은 연구가 이뤄졌고, 꽤 좋은 성능을 보여주고 있다. 이러한 것 연구에는 Word2Vec, GloV2, Universal Sentence Encoder 등의 연국 ㅏ포함된다.

그러나, task-oriented conversation 은 기존의 metric 들의 적용에 여러 challenge 가 존재한다. 우선, TOD 는 distinct component (e.g. intents, slots, utterances) 를 포함하고 있어, similiarty 와 overlap 에 impact 가 될 수 있다. 예를 들면, user 둘은 다른 objective (e.g. booking travel vs. product returns) 를 가지고 있을 수 있지만, 같은 intent 를 가지고 있을 수 있고, 다른 slot info 를 원할 ㅅ수 있다. 두 번째로, information 이 multiple conversation turn에 걸쳐 제공된다는 점이 metric 으로의 어려움을 증가시킨다. 마지막으로, 같은 task 의 set 들도 여러 user utterance 들로 표현이 될 수 있으며, 이것들은 choice of phrasing, order of sentences, use of colloquialism 등을 포함할 수 있다.

따라서, distance based similairty of utterance embedding 에 의존하는 것은 매우 나쁜 성능을 미친다.

▶ TaskDiff
이에 저자들은 TaskDiff 라는 novel similarity metric designed for TOD 를 제안한다. 위의 그림처럼, 여러 user 들은 같은 대화를 하지만, re-ordered task 와 paraphrased utterance 등을 통해 바뀔 수 있는데, 기존의 방법들은 (SBERT, ConvED, HOTT 등)은 틀리거나 robust 하지 않은 것을 볼 수 있다.

Ideal Metric to measure conversational similarity 는 conversation 의 overall goal 을 반드시 맞춰야 한다는 것이다. 위의 그림 역시 overall goal 은 세 대화 모두 동일하다. TaskDiff 는 converation 을 distribution 으로 표현한 다음, optimal transport 와 결합하여 similarity 를 측정한다. 여러가지 benchmark 에 taskdiff 를 측정한 결과 높은 performance 와 강한 robustness 를 보인다.

2. Task-Oriented Conversation Similarity

2.1. Definitions

Pre-defined user intents $I$
corresponding slots or parameteres $S$
Conversation $C$
multi-turn sequence of utterances $U$
Overall component of task-oriented conversations $K=[U,I,S]$

2.2. Approach

TaskDiff 는 component-wise distribution 의 distance 로 정의된다. 각각의 component $k \in K$ 에 대하여, 이 것을 distribution 으로 나타낸 이후, cumulative cost of trasnforming or transporting the component-wise distrubution 을 통해, 즉 optimal transport 를 통해 distance 를 계산한다.

Figure 2 에서 Overview를 볼 수 있다. 우선, 으로 slot value 들을 mask 해준다. 이는 unrelated utterance 사이의 lexcial similarity 에 의한 방해를 방지하기 위함이다. 예를 들어, "I want a ticket to the BIG APPLE"과 "I want a ticket to the APPLE CONFERENCE" 는 다른 내용을 담고 있지만, APPLE 이라는 단어 때문에 lexcial similarity 가 높을 수 있다. 이 것을 각각 와 으로 masking 해주면 이런 것을 방지할 수 있다.

이후, SBERT 를 이용하여 conversational embedding 을 얻는다. 그리고 Intent distribution 과 Slot Distribution 까지 얻은 뒤, 이 component 들의 distribution 을 활용하여 converation 을 표현한다.

Distance 는 두 conversation 의 distribution 에 대해 cost Matrix 를 활용한 1-Wassestein opritmal transport distance 를 활용한다.

ㅁ 자세한 notation 은 논문참조

3. Experiemntal Evaluation

3.1. Dataset

SGD / 20 domain / 20,000 conversations

3.2. Baselines

SBERT : cos-sim based similarity metric
Conversational Edit Distance (ConvED)
Hierarchical Optimal Transport (HOTT) : Latent Dirichlet Allocation (LDA)-based similarity metric

3.3 k-NN Classification

baseline metric 들과 TaskDiff metric 들에 대한 비교는 k-NN Classification 으로 진행한다. 비슷한 SGD conversation 들에 대해, k-NN classifiation 을 통해 잘 분류하는지 살펴본다. 결과는 아래와 같다.

TaskDiff 가 압도적으로 잘 similarity 를 표현하는 것을 볼 수 있다.

3.4. Conversational Clusters

k-means clustering 을 통해 SGD 를 표현하였을 때, TaskDiff 가 가장 well-formed and distince cluster 를 보이는 것을 볼 수 있다.

3.5. Robusteness to Reordering

Converational reordering 에도 Distance 가 증가하지 않아, 강한 robustness 를 보이는 것을 알 수 있다.

Conclusion

In this paper we present TaskDiff, a novel metric to measure the similarity between task-oriented conversations. It not only captures semantic similarity between the utterances but also utilizes dialog specific features like intents and slots to identify the overall objective of the conversations. We demonstrate that unlike existing metrics, taking advantage of these unique components is critical and results in significantly improved performance. As part of future work, we will investigate the inclusion of additional dialog features on open domain dialog datasets and the utilization of TaskDiff to improve the performance of various downstream conversational tasks.

Limitations

We demonstrate in this work that TaskDiff is a superior and more robust similarity metric compared to existing state-of-the-art approaches for task-oriented conversations. Given the use of optimal transport to compute similarity as a function of differences over the component distributions (intents, slots, and utterances), TaskDiff is reliant on being given an ontology for the intents and slots present across the conversations. However, this is a fair assumption to make for the domain of task-oriented conversations, and such ontologies are leveraged by real-world deployments such as Google DialogFlow, IBM Watson Assistant, Amazon Lex, etc.

[EMNLP 2023] Copyright Violations and Large Language Models

Wed, 21 Feb 2024 11:27:00 +0000

[pdf] [github]

Antonia Karamolegkou^1*, Jiaang Li^1*, Li Zhou¹², Anders Søgaard¹
¹ Department of Computer Science, University of Copenhagen ² University of Electronic Science and Technology of China

Abstract

(verbatim memorization) 언어 모델은 훈련 중 본 텍스트의 전체 chunk 를 포함하여 사실 이상의 것을 기억할 수 있다.
(Copyrighted text and LLMs) 이 연구는 LLM 의 copyrighted text 에 대한 침해 문제를 정확한 복제 기억의 관점에서 탐구하며, copyrighted text의 redistribution 에 초점을 맞춘다.

Introduction

▶ Verbatim memorization
당신이 “오만과 편견” 에 대해서 이야기를 하거나, 관련한 글을 작성할 때, 완전히 같은 내용을 작성하여 저작권(copyright)을 침해하는 일은 벌어지지 않을 것이다. 하지만 LLM 에게 시켜보면 어떨까?

ChatGPT 에게 성경의 첫 50 줄을 출력하게 시키면, training data 을 memorize 하여 완벽히 구절을 그대로 읊는 것을 볼 수 있다. 이렇게 LLM 이 training data 를 memorization 하는 것은 이제 어제오늘의 일이 아니다.

기존에도 Copyrighted book 에 대해 lagnuge model 의 memorization 을 probing 하는 연구는 있었다.([1]) 그러나 이 연구는 cloze-style 에 국한되었고, ad verbatim (말 그대로의) memorization setting 은 아니었다. 이 연구에서는 copyrighted text 의 문장들을 “말 그대로(verbatim)” 가져오는 것에 대한 probing 을 다룬다. 과연 LLM 은 copyrighted text 에 대한 관련 법률을 지켜낼 수 있을까?

이 연구에서는 best-seller book 과 LeetCode 에 대해 probing 실험을 진행해본다. 그 결과, copyrighted book 뿐 아니라 leetcode 등 저작권이 있는 글과 코드에 대해서, 저작권 침해의 복제가 일어날 수 있음을 확인한다.

CopyRight Laws

저작권 법과 규약은 창작자들에게 그들의 창작물을 사용하고 배포할 수 있는 독점적 권리를 부여한다. 단, 특정 예외가 있다 (예를 들어, 1952년 9월 6일의 세계 저작권 협약, 베른 협약, 미국 저작권법 §106, 디지털 단일 시장에서의 저작권 및 관련 권리에 대한 유럽 의회의 지침 (EU) 2019/790 및 지침 96/9/EC 및 2001/29/EC의 수정). 미국 저작권법 §107에 따르면, 공정 사용은 저작권 침해로 간주되지 않는 예외로, 예를 들어 도서관이나 기록 보관소가 직접적 또는 간접적 상업적 이익을 목적으로 하지 않고 문학 작품을 배포하는 경우가 해당되지만, 이는 세 부분까지로 제한된다. 이는 Large Language Models 제공업체들이 유명한 문학작품의 구절을 인용하는 것이 공정한지 여부를 주장해야 함을 의미한다.

유럽 Context 에서는 인용이 저작권의 예외 및 제한 중 하나로 2001/29/EC 정보 사회 지침의 저작권 및 관련 권리 조항에 나열되어 있다. 이 법안은 회원국이 비판 또는 검토와 같은 목적을 위한 인용을 저작권 법의 예외로 제공할 수 있도록 규정하고 있으며, 이는 공개적으로 합법적으로 이용 가능해진 작품이나 다른 주제에 관련된 경우, 출처와 저자명이 불가능하지 않은 한 표시되며, 공정한 관행에 따라 특정 목적에 필요한 범위 내에서 사용되어야 힌다. 전체 인용을 생성하는 언어 모델은 저작권 위반을 피하기 위한 좋은 실천일 수 있다. 그러나, 300 단어 이상을 그대로 인용하는 경우 공정 사용에 반대하는 판결 을 내릴 수 있는 상황도 존재한다. 따라서, LM 이 작은 텍스트 조각을 단순 인용으로 배포하고 인용을 제공하더라도 여전히 저작권 법을 위반할 수 있다.

마지막으로, 저작권 위반을 방지할 수 있는 또 다른 예외는 일반적인 관행(Common Practice) 이다. 예를 들어, 책 길이의 자료에 대해 일부는 300 단어가 일반적인 관행이라고 하지만, 다른 이들은 25단어에서 1000단어까지 다양하게 주장할 수 있다. Chapter, magazines, journals, teaching material 에 대해서는 50 단어가 일반적입니다. 저자들은 책과 교육 자료(LeetCode 문제 설명)에 관심이 있었기 때문에, 기준으로 50단어를 설정했다.

Experiments

앞서 말했듯, LLM 이 Copyrighted book 과 LeetCode 에 대해 저작권 침해 문제를 일으키는지 실험적으로 확인한다. Open-source model 로는 prefix probing 을 이용하고, closed-source instruction-tuned model 에는 direct probing 을 이용한다. 이 때 prompt 는 “What is the first page of [TITLE]?” 이다.

Datasets : 1930-2010 best-sellers (Table below), Leetcode - coding challenge problems

Language Models

Results and Discussion

Do larger language models memorize more?

[Figure2 Left]

더 큰 언어 모델이 미래에 기존 저작권을 점점 더 침해할 수 있다는 우려가 있다
60B 미만의 모델도 평균적으로 간단한 프롬프팅 전략을 사용하여 50단어 미만의 기억된 텍스트를 재현
GPT-3.5 와 Claude (둘 다 closed-source) 는 저작권 침해 문제가 심각하다.

What works are memorized the most?

[Figure2 Right]

Popularity indicators.

GPT-3.5 LCS 실험에서, Reveiw 와 Eidtion 이 클 수록 LCS length 가 커지는 양의 상관관계가 있다
LeetCode 에서는 Ranking 이 낮을 수록, 더 흔한 코드라 LCS ratio 가 크다.

Conclusion

Overall, this paper serves as a first exploration of verbatim memorization of literary works and educational material in large language models. It raises important questions around large language models and copyright laws. No legal conclusions should be drawn from our experiments, but we think we have provided methods and preliminary results that can help provide the empirical data to ground such discussions.

[ICLR2024] SELF-RAG: LEARNING TO RETRIEVE, GENERATE, AND CRITIQUE THROUGH SELF-REFLECTION

Mon, 19 Feb 2024 11:27:00 +0000

[pdf] [openreview] [github]

Akari Asai^†, Zeqiu Wu^†, Yizhong Wang^†§, Avirup Sil^‡, Hannaneh Hajishirzi^†§
^† University of Washington ^§ Allen Institute for AI ^‡ IBM Research AI

Abstract

(Hallucination and RAG) LLM 의 강력한 성능에도 Hallucination (Factual inconsinstency) 문제는 여전히 발생하고, Retrieval-Augmented Generation (RAG) 을 기반으로한 LM 모델이 이러한 issue 를 잘 다룬다.
(Unrelevant retrieval probelm) 그러나, retrieval 자체가 necessary 하지 않거나, passage 가 relevant 하지 않은 경우에는 이러한 Retireval Augment 방법 자체가 response generation 에 unhelpful 할 수 있다.
( Self-RAG ) 이에 본 연구에서는 Self-Reflective Retrieval-Augmented Generation (Self-RAG) 방법을 제안한다. 이는 하나의 LM 이 (1) passage 를 on-demand 로 retrieve 해오고, 이를 통해 (2) generate 한 이후, (3) refelection token 을 통해 retrieved passage 와 own generation 을 reflect 한다.
(Controllability) inference 과정에서 다양한 task 요구에 맞춰 reflection token 을 조절할 수 있다.
(Experiment) 실험 결과 7B, 13B 모델이 state-of-the-art RALM 을 능가하는 성능을 보였고, Open-domain QA 에서 ChatGPT 와 retrieval-augmented LLaMa2-chat 의 성능을 뛰어넘었다. 그리고 Long-form generation 에서의 factuality accuracy 도 매우 높다.

1. Introduction

▶ Retrieval-Augmented Generation (RAG)
LLM 의 강력한 성능에도 hallucination 문제가 많자, 위의 Figure-left 처럼 retrieval 을 붙인 RAG (or RALM) 이 연구가 많이 되고 있다. 그러나 이러한 것들은 unnecessary or off-topic passage 를 introduce 하여 low-quality generation 을 발생시킨다. 이러한 이유의 가장 큰 이유는 wheter the factual grounding is helpful 에 regardless 하게 가져오기 때문이다. 또한, generation LM 자체가 Retrieval 된 것을 활용하도록 학습되지는 않았기 때문에 generation 과정에서 retrieved relevant passage 에 consistent 하게 생성하는지도 알기 힘들다.

▶ Self-Reflective Retrieval-augmented Generation (Self-RAG)
이에 저자들은 Self-Reflective Retrieval-augmented Generation (Self-RAG) 를 제안한다. 이는 on-deman retrieval 과 self-refelection 방법을 통해 위의 vesatility 를 극복한다. 저자들은 special token 인 reflection token 을 활용하여, End2End 방법으로 generation 과 reflection 두 가지를 학습한다. Reflection 토큰은 retireval 토큰과 critique 토큰으로 나뉘고, 이들은 각각 need for retrieval 과 generation quality 를 판단한다.

좀 더 자세히 살펴보면, input prompt 와 preceding generation 에 대하여,

(1) SELF-RAG 는 우선 retrieved passage 로 continued generation 을 augment 하는게 좋은지 판단하고,
(2) 그렇다면, retrieval token을 출력하여 on-demand 로 retriever 를 call 한다.
(3) 이후 multiple retrieved passage 에 대하여, relevance 를 evaluate 하고, corresponding task output 을 generate 한다.
(4) 이후 critique token 이, output 을 factuality and overall quality 의 관점에서 criticize 한 이후 best one 을 고른다.

SELF-RAG 는 model vocab 을 확장하여, next token prediction 에서 reflection token 을 생성하도록 학습된다. Reflection token 은 RL의 reward model 에 영감을 받아, critic model 을 통해 학습된 original corpus 에 직접적으로 offline 으로 추가된다. Critic model 은 input, output, 그리고 GPT-4 등의 propriety LM 에 prompt 되어 collect 된 reflection token 으로 이뤄진 dataset 에 의해 학습된다. 또한, text generation 에서 쓰이는 control token 에 영감을 받아, prediction 을 assess 하고 최종 generation output 에 쓰일 critique token 을 추가활용한다.

이 Reflection token 을 통해, SELF-RAG 는 customizable decoding 을 할 수 있다. 예를 들어, retrieval frequency 를 유연하게 조정하거나, user-preference 에 맞게 reflection token prob 을 활용하여 decoding 을 조정할 수 있다.

▶ Experiments
실험 결과, reasoning 과 long-form generation 을 포함한 여섯개의 task 에 대하여, SELF-RAG 가 pre-trained and instruciton-tuned LLM 을 significantly outperform 한다. (Retreival-augmented ChatGPT 에는 4개의 task 를 앞서고, LLaMa2-chat 과 Alpaca 에 대해서는 모든 task 에서 앞선다.)

2. SELF-RAG: Learning to retrieve, generated and critique

앞서 말했듯, reflection token 의 도입을 통해, End2End LM 으로 하여금 generate, retrieve 그리고 criticize 를 하도록 한다.

2.1. Problem formalization and overview

Given input $x$ 에 대하여, $y=[y_1, …, y_T]$를 생성한다. 이 때, 각 $y_t$는 original vocab 에 추가적으로 reflection token 을 갖는다. reflection token 의 종류는 아래와 같다.

(1) Inference overview

Given $x$ 와 $y_{<t}$ 에 대하여, 모델은 retrieval 의 utility 를 평가할 retireval token 을 decode 한다. 여기서 두 가지로 나뉘는데, 우선 Retreival 이 필요하다고 판단될 경우, critique token 인 IS_REL (retireved passage 의 relevancy 를 평가), IS_SUP (생성된 Response 가 passage 에 supported 한지), IS_USE token 을 생성한다. 이후, 이 것들을 통해 multiple passage 를 rank 한다. Retrieval 이 필요없다고 판단될 경우, 다음 token 을 예측하고, IS_USE token 을 통해 criticize 한다.

(2) Training overview
Reflection token 을 vocab 에 추가한 뒤, 일반적인 next token prediction 을 통해 학습된다. 자세히 보면, generator LM 과 reflection token 을 학습하는데 이때 reflection token 을 critic model 에 의해 predict 된 것이다. Critic model 을 이용하여, training corpus 에 reflection token 을 삽입하여 update 한 뒤 training corpus 로 활용한다.

2.2. SELF-RAG Training

(1) TRAINING THE CRITIC MODEL

Data collection for critic model Manual annotation 이 비싸므로 GPT-4 를 활용하여 annotation 하면 좋다. 하지만 이러한 상업적인 LM 을 쓰는 것은 비싸고 재현성이 떨어지므로, 저자들은 GPT-4 를 prompting 하여 reflection token 을 몇 개 만든 후, in-house critic model 에 distll 한다.

GPT-4 에게 “Given an instruction, make a judgment on whether finding some external documents from the web helps to generate a better response.” 이런 식으로 prompting 한 후 reflection token 을 만든다. GPT-4 에게 시킨 이후, 인간이 평가하여 높은 점수를 얻은 것들을 추려 4k 개를 확보한다.

Critic learning

이후 4k 를 통해 crtiic model 에 distllation 학습을 한다.

아래 식을 통해 critic learning 을 진행한다.

본 연구에서는 generator LM 과 같은 Llama 2-7B 모델을 활용하였다.

(2) TRAINING THE GENERATOR MODEL

Data collection for generator

input-output pair $(x,y)$ 에서 각 output 은 여러 segment $y_t$ 로 이뤄진다. 각각의 $y_t$ 에 대하여, 학습된 critic model 이 reflection token 을 부여한다. 예를 들어, Retrieve token = Yes 가 부여되면, retriever 가 top-k passage 를 retrieve 해온다. 다시, 각 passage 에 대해 critic model 이 IS_REL 를 생성하여 passage 의 Relevant 여부를 결정한다. IS_REL = yes 라면, model generation 을 support 하는지의 IS_SUP 이 부여된다. 이후, critic model 은 생성된 generation $y_t$ 의 끝에 IS_USE token 을 부여한다.

Generator Learning

위의 자연스러운 next token prediction 을 통해 학습된다.

2.3. SELF-RAG INFERENCE

여러 경우에 Retrieval 이 필요하지 않은 경우도 있다. 예를 들어, esay 를 작성하거나하는 등의 open-ended task 에서는 retrieval 을 줄이는 것이 creativity 에 도움이 될 수 있다. 이에 SELF-RAG 는 adaptive 하게 Retrieval token 을 사용하거나 사용하지 않거나 할 수 있다.

추가적으로, Critique token 을 활용한 Tree-decoding 방법도 inference 시에 고려가능하다. (논문참조)

3. EXPERIMENTS

3.1. TASKS AND DATASETS

Closed-set Tasks : (1) Fact verficiation dataset : PubHealth (public health dataset) , (2) Multiple-choice reasoning dataset : ARC-Challenge (scientific exams)
Short-form generation tasks : PopQA, TriviqaQA-unfiltered
Long-form generation tasks : (1) biography generation task, (2) long-form QA task : ALCE-ASQA

3.2. BASELINES

Baselines without retrievals : LLaMA2-7B, LLaMA2-13B, Alpaca-7B, Alpaca-13B, ChatGPT, LLama2-chat-13B., COVE-65B
Baselines with retreivals : Rretrieval augmented LLama2, Rretrieval augmented ChatGPT

4. RESULTS and ANALYSIS

4.1. MAIN RESULTS

Comparison against baselines without retrieval

[Table2 Top]

SELF-RAG (bottom two rows) demonstrates a substantial performance advantage over supervised fine-tuned LLMs in all tasks and even outperforms ChatGPT in PubHealth, PopQA, biography generations, and ASQA (Rouge and MAUVE)
SELF-RAG also significantly outperforms a concurrent method that employs sophisticated prompt engineering

Comparison against baselines with retrieval

[Talbe2 Bottom]

SELF-RAG also outperforms existing RAG in many tasks, obtaining the best performance among non-proprietary LM-based models on all tasks

4.2. ANALYSIS

Ablation studies

[Figure3 (a)]

Effects of inference-time customization

[Figure3 (b)]

Efficiency and accuracy trade-off

[Figure3 (c)]

Effects of training data size

[Figure4 (a)]

Human evaluations

[Figure4 (b)]

Conclusion

This work introduces SELF-RAG, a new framework to enhance the quality and factuality of LLMs through retrieval on demand and self-reflection. SELF-RAG trains an LM to learn to retrieve, generate, and critique text passages and its own generation by predicting the next tokens from its original vocabulary as well as newly added special tokens, called reflection tokens. SELF-RAG further enables the tailoring of LM behaviors at test time by leveraging reflection tokens. Our holistic evaluations on six tasks using multiple metrics demonstrate that SELF-RAG significantly outperforms LLMs with more parameters or with conventional retrieval-augmented generation approaches.

초록색볼드체

초록색배경 빨간색배경

[EMNLP 2023] Poisoning Retrieval Corpora in Injecting Adversarial Passages

Fri, 16 Feb 2024 13:35:00 +0000

[pdf] [github]

Zexuan Zhong^†∗, Ziqing Huang^‡∗, Alexander Wettig^†, Danqi Chen^†
^†Princeton University ^‡ Tsinghua University

Abstract

(Adversarial Attacks on Retrieval system) 이 논문에서는 dense retreival 에 discrete token 을 purtbing 하여 training query set 에서 similarity 를 maximize 하는 adverarial passage 를 집어넣는 adversarial attack 방법을 소개한다. 이러한 adversarial passage 가 주입되면, retrieval system 을 fooling 하는데 큰 효과가 있다.
(Generalization) 심지어, 이 방법은 out-of-domain 에 대한 일반화 성능까지 가지는데, Natural Question 에 대해 optimize 된 adversarial attack 이, finnancial domain 이나 online forum 에서도 94% 이상의 attack 성능을 보인다.
(Experiment) 다양한 dense retriever 에 attack 을 진행하여 benchmark 를 세웠을 때, 대부분의 retriever 가 500 passage 정도면 attack 에 취약함을 보였고, 이 500 passage 는 보통 million 단위의 passage 를 갖는 corpus 크기에 비하면 굉장히 극소량이다.

1. Introduction

▶ What extent can retriever be safely?
Dense Retriever 가 인공지능의 영역에 들어서면서, 기존의 lexical method 에 비해 훨씬 높은 성능을 보이고 있다. 그러나, 여전히 long-tail entity 에 대해서는 성능이 약하고, out-of-domain generalization 성능은 떨어져서, 실제 real-world scenario 에서의 extent 는 어느정도인지 의문이 든다.

▶ Corpus Poisoning Attack
이 논문에서는 새로운 타입의 vulnerability 를 보인다. 이 것은 corpus poisoning attack 으로, small fraction 의 adversarial passage 를 주입하는 것으로 system 을 fooling 하는 것이다. 기존의 연구에서 individual query 에 대해 adversarial passage 가 craft 될 수 있음을 보인 것과 다르게, 이 논문에서는 user query 의 broad set 으로 부터 생성되고, out-of-domain 에 generalization 성능까지 갖춘다. 이러한 세팅은 Wikipedia 나 Reddit 같은 online forum 에 현실적으로 적용가능하고, black hat SEO (Search Engine Optimization) 의 새로운 도구가 될 수 있다.

이 attack 은 HtotFlip method 에 영감을 받은 gradient-based method 이고, 이 것은 discrete token 을 iteratively perturb 하여, training query set 에서 similairy 를 maximize 하는 방법 이다. 또한 simple clustering 기법을 차용한다.

▶ Experiment
다양한 state-of-the-art dense retreiver 에 제안된 attack 방법을 적용하였을 때, 아주 미량의 adversarial passage 만으로 system 을 바보로 만든다. 특히, unsupervised Contriever model 이 취약한데, 10개 adversarial passage 만으로, 90% 의 query 를 속일 수 있다. Supervised retriever 인 DPR, ANCE 등의 경우는 공격이 조금 힘들지만, 500 passage 정도 만으로도 50% 의 공격 성공율을 얻을 수 있다. 또한, single-token change 에 sensitive 하지 않은 multi-vector retriever 인 ColBERT 등에도 효과가 좋은 공격 방법이다. 마지막으로, 이 방법론은 out-of-domain retrieval corpora 에서도 일반화 성능이 좋다.

2. Method

2.1. Problem Definition

Dense Retriever 는 dual encoder 로, passage encoder $E_p$ 와 query encoder $E_q$ 에 대하여, inner product 를 embedding similarity 로 활용한다 : $sim(q,p) = E_q(q)^{T} E_p(p)$.

Supervised setting 인 DPR, ANCE 나 unsupervised setting 인 Contriever 에서 모두, 이 dual encoder 들은 contrastive learning objective 로 학습된다. Inference 할 때는, nearest-neighbor clustering 을 사용한다.

2.2 Corpus Poisoning Attack

한 번 corpus 가 poison 되면, (Once the corpus has been poisoned), dense retriever 는 adversaril passage 를 retrieve 할 걸로 기대된다. Adversarial passage $A= {a_1, a_2, …, a_{|A|}}$ ($|A|« |C|$$)에 대해, at least 하나의 adversarial passage 가 top-k nearest cluster 에 포함되는 것이 objective 이다.

2.3. Optimization

Query set $|Q|$ 에 대하여, 가장 retrival resul 에 많이 포함될 수 있는 adversarial passage $A$ 를 얻는 것이 목표다. Model 을 mislead 하기 위해, 아래의 수식을 통해 sequence of token $a=[t_1, t_2, …]$ 를 찾는다.

우선, single passage 를 어떻게 generate 하는지 살펴보고, multiple passage 로 넘어가면, HotFLIP 에 영향을 받아 optimization problem 을 푸는 gradient-based method 를 제안한다. 이 방법론은 token 을 replace 하면서, model 의 output 을 approximate 하는 방법이다. 우선, adversarial passage 로 random passage 를 시작점으로 한다. 각각의 step 마다 token $t_i$ 가 다른 token $t_i`$ 로 바뀔 때의 model output 의 approximation 을 계산한다. 이 approximation 계산을 HotFlip 과 같이, gradient 를 사용하고 , 그 수식은 이다.

따라서, given query set $Q$ 에 대하여, best replacement candidate token 을 찾아내는 것이 목표고, 아래와 같다.

여기서 $V$ 는 vocab 이다.

2.4. Generating multiple adversarial passages

위의 방법대로 하나의 adversarial passage 를 얻는 방법에 대하여, multiple passage 를 얻는 방법으로의 확장을 살펴보자. query 의 embedding $E_q (q_i)$ 에 대해, 이 것을 k-means clustering 을 통해 여러 개를 묶은 후, 각각의 query 에 대해 하나의 adversarial passage 를 얻는 것을 반복하여, 여러 개의 passage 로 확장시킨다.

3. Experimetns

3.1. Setup

Retrieval datasets : Natrual Question (NQ), MS MARCO
Eval sets : BEIR unseend datasets (e.g., Quora, scientific, financial documents)
Dense Retriever : Contreiver, Contriever-ms, DPR-nq, DPR-mul, ANCE, ColBERT
Evaluation Metrics : top-k attack success rate

3.2. Attacks on In-domain Queries

Contriever 가 매우 공격에 취약한데, 단 하나의 추가적인 adversarial passage 가 75% 의 query 를 속인다.
그에 반해 supervised method 인 DPR, ANCE 는 robust 하다.
그래도, 오른쪽 그래프를 보듯이 500 passage 정도로 DPR, ANCE 역시 50% 의 query 를 공격할 수 있다.

3.3. Attacks Transfer Out-of-Domain

NQ, MS MARCO 두 개의 학습으로 거의 대부분의 target domain 에서 높은 query 공격 성능을 보인다.

3.4. Attacks on Multi-Vector Retriever

single-token change 에 sensitive 하지 않은 multi-vector retriever 인 ColBERT 등에도 효과가 좋다.

Conclusion

We proposed a new attack for dense retrievers, in which adversarial passages are inserted into the corpus to mislead their retrieval outputs. We show that even a small number of adversarial passages can successfully attack state-of-the-art dense retrievers and generalize to queries from unseen domains. These findings have important implications for the future deployment of robust retrieval systems in real-world applications.

[Arxiv 2307] Evaluating the Ripple Effects of Knowledge Editing in Language Models

Wed, 14 Feb 2024 07:35:00 +0000

[pdf] [github]

Roi Cohen¹, Eden Biran¹, Ori Yoran¹, Amir Globerson^1,2, Mor Geva^1,2
¹ Blavatnik School of Computer Science, Tel Aviv University ² Google Research

Abstract

(Obsolete knowledge) 최근 LM 들은 factual knowledge 를 잘 capture 하지만, knowledge 가 obsolete(구식)할 경우, incorrect generation 을 하게 된다.
(Existing Edition Evaluation) 기존에는 이러한 것에 대해서, updated 된 특정 지식을 성공적으로 edit 하는지 평가할 때, individual fact 가 잘 주입되었는지, 그리고 동시에 다른 subject 는 변하지 않았는지 여부를 측정한다.
( Ripple Effect ) 이 논문에서는, 하나의 fact에 대한 injection 이 다른 fact 에 대한 update 를 가져 온다는 “ripple effect” 를 정의하고 다룬다.
(RippleEdits) 그리고 그 ripple effect 에 대한 criteria 를 정의한 후, 그에 걸맞는 5k factual edit 에 관한 benchmark 인 RippleEdits 를 구성한다. 실험 결과, 여러 모델에서 이러한 ripple effect 를 잘 처리하지 못하는 것을 확인하였고, simple in-context editing baseline 이 좋은 editing 성능을 보임을 확인한다.

1. Introduction

▶ Existing Knowledge Editing (KE) method

현재 LM 들이 Factual Knowledge 를 잘 capture 하지만, knowledge 가 outdated 될 경우, incorrect factual generation 을 하게 된다. 이를 위해, 여러 연구에서 knowledge editing (KE) 기법을 통해, 이러한 factual error 를 고치려는 시도가 많이 있었따. Existing KE 방법들은 보통, entity-relation-object $(e,r,o)$ triplet 을 덮어쓰는 (override) 방식으로 (보통, $e$ 와 $r$ 을 덮어쓴다) knowledge editing 을 한다.

이러한 KE 방법들에서 가장 중요한 key point 는 editing success 를 체크하는 “sanity-check” 이다. 보통 $(e,r,?)$ 의 질문을 통해 outdated 된 $o$를 가져오는지 updated 된 $o$를 가져오는지로 평가할 수 있다. 이에 추가적으로 다른 fact 들에 대한 왜곡(distortion)이 있지 않아야 하기 때문에, 그에 대한 평가들도 수반된다.([1],[2],[3]))

▶ Ripple Effects
이 논문에서는 knowledge editing 이 일어날 때, 어떠한 동반되는 fact 는 같이 변해야하며 (위의 예시에서 messi 가 이적했을 떄, Team 이 가지고 있던 선수 정보에 messi 가 추가되어야 한다), 또 어떠한 사실은 변하지 않아야 한다(메시가 팀을 변경해도 여전히 국적은 아르헨티나이다). 이렇게 하나의 fact 변동이 다른 fact 들에 대해 연동의 결과를 미칠 수 있는 것을 저자들은 Ripple Effect 로 정의한다. 이 Ripple Effect 를 제대로 정의하기 위해, 저자들은 여섯 가지 concrete evaluation crieteria 를 제시한다.

▶ RippleEdits Benchmark
이후, 위의 criteria 들을 기반으로 RippleEdits 라는 benchmark 를 구성한다. 이는 5k entry 로 이뤄져 있으며, ripple effect 를 고려하여 edit 이 성공적으로 이뤄지는지를 평가한다. 이 benchmark 속에는 timestamp 를 meta data 로 지닌다.

저자들은 이 RippleEdits benchmark 를 활용하여, 5개의 LM 에 3개의 Knowledge Editing 기법을 적용하였을 때, 대부분의 evaluation criteria 에서 poor performance 를 보임을 확인한다. 추가적으로, (1) larger model 일 수록 ripple effect 를 처리하기 쉬우며, (2) frequent entity 를 edit 하는 것이 logical reasoning error 를 더 많이 발생시킨다 는 현상을 확인한다.

마지막으로, casual attnetion mechanism 을 기반으로한 simple in-context editing 기법을 통해 기존의 parametric KE 방법을 outperform 하는 새로운 방법론을 제안한다.

2. Problem Setting

Factual Knowledge | $(e,r,o)$ triple 에 대하여 두 가지 edit type 을 정한다. (1) modification 은 이미 모델이 가지고 있는 outdated 된 지식 $(e,r,o)$ 를 $(e,r,o)$ 로 고치는 것이고, _(2) injection_은 새로운 지식 $(e,r,o)$ 를 주입하는 것이다.

일대일 대응이 되는 (e.g. Date of Birth) injection 의 경우, $(e,r,∅)$ 에서 $(e,r,o*)$ 로 empty objet 를 editing 하는 case 로 볼 수 있다. 반면, Sibling 이나 Occupation 과 같은 one-to-may relation 의 경우, injection edit 이 (e, r, {o1, .., on}) → (e, r, {o1, .., on, o∗}) 로 바꾸는 augment 가 된다.

3. Ripple Effects of Factual Edits

전체 knowledge-graph $K$ 에 대하여, edit δ : $(e,r,o) -> (e,r,o`)$ 가 K 속에서 가져오는 변화인 ripple effect 를 $R(δ)$ 로 정의할 수 있다. 그리고 그 크기 $

R(δ)

$ 는 하나의 edit 이 전체 knowledge graph 에 미치는 ripple effect 의 크기로 볼 수 있으며 이를 severity 로 정의한다.

3.1. Evaluation Criteria

2-hop 내의 ripple effect 를 다음의 6가지로 분류하여 crieteria 를 선정한다.

각 criteria 에 대한 내용은 논문 참조

4. The RIPPLEEDITS Benchmark

4.1. Data Generation Pipeline

Step 1: Factual triplets collection
첫 번째 step 은 fact 를 collection 하는 것이다. 아래의 세 가지 type 을 WIKIDATA 에서 추출한다.

RECENT : 2022 년 7월 이후에 생성된 최신 지식들을 통해 injection editing fact 를 추출
RANDOM : 추후 modification edit 이 될 수 있게 random 하게 fact 를 추출.
POPULAR : Severity 가 큰 경우를 위해 인기있는 triplet 을 추출

Step 2: Edits generation

위의 RECENT 를 기반으로 RANDOM/POPULAR 등의 오래된 지식들을 edit 하는 edit generation 을 진행한다.

Step 3: Evaluation tests generation

새로운 query 에 대해서 이 과정을 반복하여 test set 을 generation 한다.

Step 4: Phrasing in natural language

이후 이 knowledge graph 를 자연어 문장으로 phrasing 한다.

4.2. Data Statistics

5. Experiments

5.1. Evaluation Setting

Editing Method : MEND, ROME, MEMIT
Baseline : In-context Editing (ICE)
Models : GPT-2 XL, GPT-J, LLaMA, GPT-NeoX, GPT-3

5.2. Results

결과는 거의 Baseline 실험 제시이다.

(1) RECENT

(2) RANDOM

(3) POPULAR

(4) Avg. of ROME

(5) Accuracy of MEND, ROME, MEMET

Conclusion and Disccusion

We introduce the notion of ripple effects in knowledge editing, suggesting that editing a particular fact implies further updates of related facts. We additionally propose evaluation criteria for ripple effects and create RIPPLEEDITS, a diagnostic benchmark designed to evaluate how well KE methods handle the ripple effects of various edits. We evaluate prominent KE methods and show that they often fail to introduce consistent edits that capture the ripple effects of an edit, suggesting that future development of KE methods should consider those effects more carefully. Last, we show that a simple in-context editing method achieves the best results on RIPPLEEDITS, highlighting the potential of such editing approaches.

Notably, our benchmark covers a small fraction of all possible ripple-edits. For example, one could consider ripple effects that involve more than two hops, and explore the graph structure of different edits. In addition, while we focus on ripple effects of single edits, future work can consider the effect of editing multiple facts in a single batch. Finally, it would be interesting to consider cases where models succeed in capturing ripple-edits, and analyze how these are implemented mechanistically in the transformer architecture.

[ICML2022] HyperPrompt: Prompt-based Task-Conditioning of Transformers

Mon, 05 Feb 2024 00:08:00 +0000

[pdf]

Yun He ^1*, Huaixiu Steven Zheng ^1*, Yi Tay ², Jai Gupta², Yu Du ², Vamsi Aribandi ², Zhe Zhao ², YaGuang Li², Zhao Chen³, Donald Metzler², Heng-Tze Cheng², Ed H. Chi²
¹ Texas A&M University ² Google Research ³ Waymo LLC.

Abstract

( Hyperprompt ) 이 논문에서는 Transformers 속의 self-attention 에 prompt-based task-conditioning architecture 인 Hyperprompt 를 제안한다.
(Global memory) HyperNetwork 를 활용하는 hyperprompt 는 task 간의 정보 교환을 해주는 역할에 더불어, task 의 global memory 의 역할을 한다는 것을 보인다.
(Efficiency) 단지 0.14% 의 추가적인 param 만으로 T5 등의 multi-task learning baseline 과 비교하여 competetive 한 성능을 보인다.

1. Introduction

▶HyperPrompt
Soft Learnable memory token 으로 LLM 을 condition 하는 prompt tuning 이 주목을 받고 있다. Pretrained model 은 frozen 한 채 빠르고 가볍게 학습할 수 있다는 장점을 갖고 있다.

이 논문에서는 Multi-task learning 을 위한 새로운 Prompt-tuning 방법론인 HyperPrompt 를 제안한다. HyperPrompt 는 task-conditioned hyper-prompt 를 도입하여, prompt 를 통해 task-specific information 을 모델이 condition 할 수 있게 한다.

▶HyperNetwork
저자들은 이 hyperprompt 를 위하여, HyperNetwork 를 도입한다. 이 HyperNetwork 가 task-aware and layer-aware prompt 를 generation 한다. 보통 기존의 multi-task learning 방법들은 task 수에 linear 하게 param 이 증가하기 마련인데, HyperNetwork 를 활용하면 아주 적은 양의 추가적인 param 만으로 기존의 방법들과 competitive 한 성능을 보일 수 있어 효율적이다. 그리고 이들은 prompt generator 의 개념은 HyperNetwork 가 처음이라고 주장한다.

▶Training whole network including LM

이들은 기존의 prompt 학습 방식이나 adapter 와 같은 개념과 다르게 LM 을 포함한 network 전체를 학습시키는 것이 중요하다고 한다. 그 이유로는 (1) 기존의 prompt-tuning 은 11B 이상의 LLM 에 대해서만 잘 적용이 되며, (2) adaptive param 만 학습한다고 해서 inference 에서 딱히 이득이 없다고 한다. 따라서, Network 를 전체 학습하여 성능을 높이는 것이 더 낫다고 판단한다.

2. Methods

HyperPrompt 에는 세 개의 변형 : HyperPrompt-Share, HyperPrompt-Sep 그리고 HyperPrompt-Global 이 있다.

가장 중요한 기본 개념은 (1) task-condtioning 을 self-attention 이 넣는 것, 그리고 (2)Prompt generation 을 위해 HyperNetwork 를 활용하는 것이다.

2.1. Prompt-based Task-Conditioned Transformer

기존의 adtaper-based 방법들은 adapter(dense-relu-dense network) 를 Transformer block 의 FFN 직후에 집어넣는 방법들이었다. Hyperprompt 에서는 대신 각각의 layer 에 task-conditioned trainable vector 를 key 와 value 앞에 prepend 한다. Netowrk 앞에 learnable prompt 를 prepend 하는 것은 이미 여러 연구가 존재하지만, multi-task learning 을 위하여 이 아이디어를 적용하는 것은 처음이라고 주장한다.

이를 위해 저자들은 HyperPrompt 방법을 제안한다. 위의 그림의 (a) 에서 보듯이 Key 와 Value 앞에 HyperPrompt 를 prepend 한다. 이후 기존 Transformer 방식처럼 Self-Attention 을 진행한다. 이는 장점이 있는데, Hyperprompt 가 attention feature map 형성에 관여한다는 점이 task-speific memory 로써 역할을 할 수 있다.

2.2. HyperPrompt

m-th layer 의 hyperprompt 를 어떻게 생성할 것인가에 대하여, 나이브하게 layer 마다 T(# of task)를 만든 다음 random init 하면 되지만, 이 경우 O(T X M) (M: # of layer) 로 비효율적이라고 한다.

이들은 우선, task 별로 global prompt 를 만든 다음, 이 global prompt 를 각 layer block 으로 projection 하여 M 개를 얻는 방법을 택한다.

(1) Global Prompts
첫 번째로, Task 개수 T 만큼의 global prompt 를 init 한다.

(2) Local HyperNetworks
각각의 Transfomer layer block 에서, 두 local HyperNetwork 가 global prompt 를 입력으로 받아, key local prompt 와 value local prompt 를 생성한다. HyperNetwork 는 위의 figure (b) 에서 보듯이, down-projection 을 포함한 bottleneck architecture 를 활용한다.

(3-1) HyperPrompt-Share
앞서 말한 key, value local prompt 생성을 위한 hypernetwork 를 Task 마다 다르게 하지 않고, 모두 share 하는 setting 이다. 이 경우, parameter 는 많이 saving (1/T 로) 할 수 있겠지만, 실험 결과 모델 capacity 가 줄어든다고 한다.

(3-2) HyperPrompt-Sep
따라서 그 반대로, 각각의 task 마다 own local HyperNetwork 를 갖게하는 HyperPrompt-Sep 방법의 성능이 더 좋다고 한다.

2.3. HyperPrompt-Global

그리고 다시 이 task-specific and layer-specific HyperNetwork 를 효율적으로 생성하기 위하여, Figure (c) 와 같이, global HyperNetwork 인 HyperPromt-Global 을 도입한다. 이는 Lyaer-Aware Task embedding 을 입력으로 받아, GLobal HyperNetwork 를 통해, 각 Layer 별 Hypernetwork 를 생성한다.

3. Experiments

3.1. Experimental Setup

Dataset : GLUE, SUPERGLUE
Transformers : T5-Base to T5-XXL
Baselines : vanilla T5, vanilla Adapter, HyperFormer++ (adapter-based MTL model), Prompt-Tuning

3.2. Key Results

(1) Prompt-tuning 은 11B 모델에서만 잘 작동된다.

(2) HyperPrompt 가 모든 모델 사이즈 전반에 걸쳐 좋은 성능을 보인다.

3.3. Tuning all vs Task-Conditioned Params

기존의 연구에서, LM 을 전부 tuning 하는 것보다 prompt 만 tuning 하는 것이 더 좋다는 연구 결과가 있었지만, 그 연구는 GLUE benchmark 에 대해서 작은 모델인 T5 base, T5 small model 에 대해서만 측정했다고 한다.

이 실험에서는, full model 과 task-conditioned param 만 학습하는 것을 비교실험한다.

이 실험에서 보듯이 HyperPrompt 를 활용하는 경우 Full Model 을 tuning 하는 것이 훨씬 좋은 성능을 보인다.

3.4. Computational Efficiency

HyperPrompt 는 FFN 을 사용하지 않고, self-attention 에 버무려지기 떄문에, 더 적은 #Ops 를 가진다. 또 추가적으로, Training Time 역시 효과적이다.

3.5. Ablation Study

위의 표는 T5 base, 아래 표는 T5 large 에서의 실험 결과이다.

(1) HyperPrompt-Global vs Prompt-Tuning.
Prompt-Tuning 은 single task finetuning 과정이고, LM 전체를 tuning 하지 않기 떄문에 Fair 비교를 위해 Task 별 prompt 를 추가하고 LM 전체를 tuning 하여 비교한다. 실험 결과, GLUE 와 SUPERGLUE 에서 모두 더 좋은 성능을 보인다.

(2) HyperPrompt-Global vs HyperFormer++.
Adapter-based 방법인 HyperFormer++ 와의 비교에서도 우위의 성능을 보인다.

(3) HyperPrompt-Global vs MTL.
Multi Task Learning 을 통해 task 여러 개를 다 학습한 모델과 비교했을 때, 아주 적은 양의 Additional Param (1.02배)로 성능향상을 이끌어낸다.

(4) HyperPrompt-Global vs HyperPrompt-Share/Sep.
놀랍게도 HyperPrompt-Share 모델이 Sep 보다 SUPERGLUE 에서는 더 성능이 좋다. 그리고 projection network 를 생성하는 global HyperNetwork 를 쓰는 HyperPrompt-Global 이 모든 경우에서 가장 좋은 성능을 보인다.

4. Conclusion

We propose a novel architecture for prompt-based taskconditioning of self-attention in Transformers. The hyperprompts are generated by a HyperNetwork to enable flexible information sharing among tasks while remain efficient in parameters and computation. HyperPrompt allows the network to learn task-specific feature maps where the hyper-prompts serve as task global memories, encouraging a more diverse distribution of attention. Extensive experiments show that HyperPrompt can achieve superior performances over strong T5 multi-task learning baselines and parameter-efficient models including Prompt-Tuning and HyperFormer++ on GLUE and SuperGLUE benchmarks.

[Arxiv 2312] NoMIRACL: Knowing When You Don’t Know for Robust Multilingual Retrieval-Augmented Generation

Fri, 02 Feb 2024 03:00:00 +0000

[pdf] [github] [huggingface]

Nandan Thakur¹, Luiz Bonifacio^1,3, Xinyu Zhang¹, Odunayo Ogundepo¹, Ehsan Kamalloo¹, David Alfonso-Hermelo², Xiaoguang Li², Qun Liu², Boxing Chen², Mehdi Rezagholizadeh², Jimmy Lin¹
¹ David R. Cheriton School of Computer Science, University of Waterloo, Canada ² Huawei Noah’s Ark Lab ³ FEEC-Unicamp, Brazil

Abstract

(Lack of Evaluation of Multilingual LLM robustness) RAG 가 LLM external knowledge 에 를 leverage 하여 factual hallucination 을 경감하는데 큰 역할을 하지만, external retreived knowledge 속의 error 에 대한 robustness 에 대한 평가, 특히 영어 이외의 다른 언어 집단에서의 평가는 어렵다.
( NoMARICL ) 18개의 언어에 대하여, RAG 에 대한 LLM robustness 를 측정하는 NoMIRACL benchmark 를 제안한다. 이는 인간이 평가한 No-relevant passage 를 의도적으로 query 로 집어넣어 평가를 한다.
(LLM Robustness) 논문에서는 두 가지 측면에서 RAG 에 대한 LLM robustness 를 측정하는데 (1) hallucination rate 와 (2) error rate 이다.
(Experiment) GPT-4 가 영어나 프랑스 등의 higher-resource language 에서 더 hallucination 을 잘 일으키는 현상을 발견한다. 앞으로 non-relevant information 을 어떻게 잘 reject 하는 지에 대한 foundation 연구가 될 수 있다고 주장한다.

1. Introduction

▶ Challenging issue in RAG
Retrieval Augmented Generation (RAG) 는 reliable knowledge corpora 속의 정보를 LLM 에 잘 주입시키는 역할을 한다. 그러나 RAG 는 LLM 을 통한 강력한 generation 단에 비하여, retrieval system 은 relevant information 을 가져오는데 있어서 어려움을 보이는 경우가 있다. 특히, zero-shot domain 이나 low-resource language 에 대해서는 이런 retrieval 취약점이 더 잘 드러난다. 이러한 incorrect 하거나 non-relevant 한 information 은 LLM 으로 하여금 hallucination 을 일으키게 만든다. 하지만, 현재까지 low-resource 등 multilingual setting 에서 LLM reasoning capa 를 측정한 연구는 없었다.

▶ NoMIRACL
이 논문에서는 first-stage external information 에서의 error 에 저항하는 LLM robustness 를 측정하는 NoMIRACL benchmark 를 소개한다. 30 명의 native speaker 를 고용하여 dataset 을 구성하였다. NoMIRACL 은 두 개의 subset 으로 구성되어있으며, 각각 non-relevant, relevant 이다.

GPT-4 를 baseline 으로 하여 실험한 결과, GPT-4 가 non-relevant passage 에 대해 33.2% hallucination rate 을 보이는 것을 관찰하였고, relevant subset 에 대해서는 14.2% 의 비교적 낮은 error rate 을 보이는 것을 관찰하였다. 이를 통해 RAG 에서의 retrieval 단의 중요성을 확인할 수 있다.

추가적으로, GPT-4 hallucination rate 과 language resource size 에 positive correlation 이 있음을 통해, higher-resource (영어, 프랑스어등) 에서 더 많은 hallucination 이 있음을 관찰한다.

2. Background and Problem Identification

2.1. Retrieval-Augmented Generation (RAG)

RAG 는 factual correctness 를 위한 최신 연구에서 가장 핵심적인 기술이다. RAG 에 대해서 간단히 background 를 설명하면, first-stage 로 retriever 가 주어진 query 에 대한 top-k passage 를 retrieve 해온다. 이후, LLM 을 활용한 generation 단에서 second-stage로, query 와 retrieved top-k passage 를 활용하여 output 을 생성한다.

2.2. Robustness Evaluation

LLM 의 Robustness 평가는 위의 contingency table 로 간단하게 측정한다. Non-relevant passage 로부터는 “I don’t know”를, relevant passage 로부터는 “Yes, Answer is Present” 를 생성해야한다.

3. NoMIRACL Dataset

Data Construction Procedure

NoMIRACL 은 MIRACL dataset 을 기반으로 만들어진다. 첫 번째로, annotator (native language speaker) 가 prompt text 에 대해 well-formed query 를 생성하도록 요구된다. 각각의 prompt 는 language-specific Wikipedia 의 첫 100 단어 snippet 이다.

이후, hybrid multilingual retreival system 이 top-k passage 를 retireve 해온다. 이 것을 각각 annotator 들이 relevant, non-relevant 로 label 한다. 이 때, 모든 top-k passage 가 relevance 0 으로 label 되면 non-relevant subset 이 된다.

4. Experiemental Setup

Retriever system : BM25 + mDPR + mColBERT (mBERT trained with MS MARCO) LLM 에게 retrieved subset 을 보고 “I don’t know”, “Yes, answer is present” 중 하나를 고르게 하여, hallucination rate = FP/(FP+FN), error rate = TN/(TN+TP) 를 측정한다. prompt 는 아래와 같다.

5. Experimental Results

GPT-4 hallucination rate on the non-relevant subset

(1) GPT-4 는 33.2% hallucination rate 를 보여, all non-relevant passage 를 알아차리는 것이 쉽지 않다는 것을 보인다. (2) 프랑스어 스페인어 등 corpus size 가 큰 resource 에서 hallucination rate 이 큰 강한 상관관계를 확인한다.(Spearman 0.39)

GPT-4 error rate on the relevant subset

14.9% 정도의 낮은 error rate 을 보여, retrieve 가 잘 된 relevant subset 에 대하여는 LLM 이 잘 identify 할 수 있음을 보인다.

Conclusion

Retrieval-augmented generation setups are effective in the factual grounding of LLM-generated output with information available in external knowledge bases. However, in cases where retrieved information does not assist in providing the answer, a robust LLM should potentially identify them. In our work, we provide NoMIRACL for evaluating LLM robustness as a binary classification task on 18 languages. We provide two subsets in NoMIRACL, the non-relevant subset, where queries contain all non-relevant passages, and the relevant subset, where queries contain at least a single relevant passage. We evaluate robustness using two metrics: hallucination rate and error rate.

In our work, we build a GPT-4 baseline on NoMIRACL. GPT-4 achieves a high 33.2% hallucination rate on the non-relevant subset and 14.9% error rate on the relevant NoMIRACL split, highlighting that GPT-4 finds it challenging to dismiss non-relevant passages over relevant passages in first-stage retrieved information. We open-source NoMIRACL and hope it encourages future work to improve LLM robustness.

[EMNLP2023] EXPLORE-INSTRUCT: Enhancing Domain-Specific Instruction Coverage through Active Exploration

Wed, 31 Jan 2024 08:47:00 +0000

[pdf] [github]

Fanqi Wan^1*, Xinting Huang^2†, Tao Yang¹, Xiaojun Quan^1†, Wei Bi², Shuming Shi²
¹ School of Computer Science and Engineering, Sun Yat-sen University, China ² Tencent AI Lab

Abstract

(Diversified Instruction Tuning) Instruction Tuning 은 broader specturm 의 task 로 diversity 를 학습할 수 있다.
( Lack of Data ) 그러나, 현존하는 data 로는 individual domain 에 대한 data 까지의 coverage 는 불가능하다.
( EXPLORE-INSTRUCT ) 이 논문에서는 LLM 의 exploration 을 통해, domain-specific instruction-tuning 를 위한 data coverage 를 발전시키는 방법론을 제안한다. 이 방법론은 몇몇 대표적인 domain 으로부터, diversified and domain-focused instruction-tuning data 를 생성할 수 있다.
(Experiment) 생성한 data 에 대한 평가에서, 넓은 diversity 를 커버할 수 있는 것을 확인할 수 있으며, baseline 에 적용되었을 때, domain sepcific data enahncement 를 보였다.

1. Introduction

▶ 기존의 다양하지 않은 instruction-tuning data
Large Language Model(LLM) 은 instructino 을 통해 다양한 영역의 문제를 해결할 수 있음을 보인다. 따라서 중요한 challenge 는 diverse instruction-tuning data 를 construct 하는 것이다. 기존에 많이 사용되는 human-curated instruction-tuning data 로는 FLAN, SUEPR-NATURALINSTURCTION 등이 있다. 최근 SELF-INSTRCUT 논문에서 instruction-tuning data 의 diversity 를 amplify 하는 방법론을 제안하기도 하였다.

이러한 general 한 instruction-tuning data 에 대비하여, domain-specific instruction-tuning data 를 자연스럽게 정의할 수 있다. Human-curated instruction-tuning data 나 Machine-generated data 모두 활발히 연구가 되고 있지만, 이러한 것들이 wide range 를 커버하지는 못하고 있는 실정이다. 이러한 이유는 대부분이 human curation 에 over-reliance 하고 있으며, popular NLP task 에 bias 되어 있기 떄문이다. 위의 Figure 에서 보듯이, Human-curated (핑크색) 과 SELF-INSTRUCT (하늘색) 는 다양한 범위를 커버하지는 못하는 것을 볼 수 있다.

▶ EXPLORE-INSTRUCT
이러한 instruction-tuning data creation 에서의 다양성 확보를 위해 EXPLORE-INSTRUCT 방법론을 제안한다. 저자들은 우선 domain space 는 내재적으로 tree 의 구조를 가지고 있다고 판단한다. Classical Search Algorithm 과 LLM 의 강력한 파워의 결합으로, EXPLORE-INSTRUCT 는 domain space 를 traverse 하며 instruction-tuning data 를 생성한다.

이 방법론은 두 가지 _(1)lookahead (2)backtracking_strategy 를 취한다. 첫 번째는 fine-grained sub-task 를 철저히 조사하며, backtraking 은 domain specturm 을 넓히기 위하여 alternative branch 를 찾는다. 저자들은 Depth-first search(DFS)의 간단한 방법으로 넓은 domain space 를 찾아낼 수 있음을 보인다.

▶ Validation of EXPLORE-INSTRUCT
이들은 EXPLORE-INSTRUCT 방법론의 평가를 위하여, rewriting, brainstroming, math 의 세 가지 이질적인 domain 을 선택하여 testbed 로 활용한다. 이들은 각각 unique use case 와 different skill set 을 필요로 한다. EXPLORE-INSTRUCT 를 적용하였을 때, 각가의 domain 에서 엄청난 수준의 넓은 coverage 를 갖는 것을 확인한다. 이 과정에서 LLM 이 넓은 범위의 domain 을 탐색할 수 있을 뿐 아니라, 각각의 task 를 깊게 이해하여 fine-grained sub-task 로 decompose 할 수 있음을 보인다. 이렇게 생성된 instruction-tuning data 를 fine-tuning 하였을 때, baseline 을 뛰어넘는 성능을 보인 것을 확인한다.

2. Method

2.1. Domain Space Representation

저자들은 domain-specific instruction 의 coverage 에 대한 개념을 다음 두 가지로 정리한다.

breadth : domain 속의 different task category 를 의미
depth : fine-grained task decomposition 을 의미

Breadth 를 이해한다는 것은 domain 속의 여러 catgeory 를 이해하는 능력을 가진 것이고, depth 를 이해한 다는 것은 task decomposition 을 통해 precise problem solving 을 할 수 있다는 것이다.

따라서 저자들은 tree structrue $T$ 는 task nodes $V$ 와 sub-task 와의 edge $E$ 로 이뤄진다고 가정한다.

2.2. Active Exploration Strategy

실제 EXPLORE-INSTRUCT 알고리즘은 두 가지 (1)lookahead, (2)backtracking exploration 으로 이뤄진다.

(1) Lookahead Exploration
Lookahead는 depth 를 파고드는 exploration 으로 다시 말해, fine-grained sub-task 를 mapping out 하는 것이다. 즉 task $V$ 에 대해서, lookahead exploration 은 LLM 을 활용하여 M 개의 sub-task 를 만드는 과정이다. lookahead prompt 는 아래와 같다.

(2) Backtracking Exploration
Backtracking 은 breadth 를 넓히기 위한 것이다. 따라서, given task node $V$ 에 대해서 backtracking 은 부모 노드 $PV$를 찾은 뒤, 그 것에서 LLM 을 활용하여 M 개의 새로운 sub-task 를 찾아내는 것이다. prompt 는 위의 lookahead 와 비슷하다.

2.3. EXPLORE-INSTRUCT Implementation

EXPLORE-INSTRUCT 는 두 개의 process 로 이뤄진다. (1) Domain Exploration Strategy
위의 lookahead 와 backtracking 을 통해 하나의 root task 의 breadth 와 depth 를 확장하는 것이다. Stopping crieteria breadth B 와 depth K 를 만족할 때 까지 반복하여 tree 를 확장시킨다.

(2) Instruction-Tuning data generation
이렇게 확장된 domain space tree 를 LLM 을 활용하여 N 개의 instruction 과 corresponding reponse 를 생성한다. instruction 의 다양성을 확보하기 위하여, diversity filter 도 적용하였다. SELF-INSTRUCT 에서 활용한 ROUGE-L overlap 을 filter 로 활용한다.

3. Data-Centric Analaysis

우선 EXPLORE-INSTRUCT 의 효용성을 검증하기 위해 생성되는 data 를 분석해본다. Baseline 으로는 (1) Domain-Specific Human_curated : SUPER-NATURALINSTURCTION , (2) Domain-Aware Self-Instruct : Depth K=0 으로 한 EXPLORE-INSTRUCT 방법이다.

3.1 Data Statistics

위의 표에서 Human-Curation, Domain-Aware Self-Instruct, EXPLORE-INSTRUCT 에 대한 statistics 를 볼 수 있다. EXPLORE-INSTRUCT 의 verb-noun pair 의 수가 더 많으면서도, 표준편차는 작은 것을 볼 수 있다. 이러한 현상은 rewriting 과 math domain 에서 두드러지는 반면, brainstroming 에서는 그렇지 못하다.

비쥬얼라이제이션을 위하여 10 개의 root task 를 정하여 generated instruction 을 비교했을 때, 아래의 EXPLORE-INSTRUCT 가 훨씬 더 다양하게 생성한 것을 볼 수 있다.

또한, SELF-INSTRUCT 부터 활용 중인 ROUGE-L overlap 에 관한 figure 는 위에서 볼 수 있다. 역시 핑크색으로 표현된 EXPLORE-INSTRUCt 가 더 넓은 다양성을 보인다.

4. Experiment

4.1. Benchmarks

EXPLORE-INSTRUCT 가 생성한 세 개의 domain 을 testbed 로 활용한다

rewriting : BELLE test set 으로 부터 생성한 testbed
brainstroming : BELLE test set 으로 부터 생성한 testbed
math : MATH test set 으로 부터 생성한 testbed

4.2. Expore-LM and Baseline Models

Explore-LM
EXPLORE-INSTRUCT 가 생성한 instruction-tuning data 를 fine-tuning 한 모델이다. : Ours model Explore-LM-Ext 는 sampled instance 를 확장하여 fine-tuning 한 extension 모델이다.

Baseline Models

Domain-Curated-LM : 앞서 언급한 Human-curated data 를 학습한 모델
Domain-Insturct-LM : 앞서 언급한 Domain-aware self-instruct data 를 학습한 모델
ChatGPT

위의 언급되는 모델은 ChatGPT 를 제외하고 모두 LLaMA 7B 를 backbone 으로 하여 fine-tuning 한다.

4.3. Resluts and Anlaysis

(1) Automatic evaluation results in the brainstorming and rewriting domains.

brainstorming and rewriting 의 domain 에서는 ChatGPT 를 제외한 모델들에 대해 압도적인 성능을 자랑한다.

(2) Automatic evaluation results in the math domains.

ChatGPT 에는 크게 미치지 못하지만, 작은 차이로 baseline model 대비 성능 향상을 이룬다.

(3) Human Evaluation.

ChatGPT 에는 지지만 다른 baseline model 대비 우세를 보인다.

(4) Data Structure Analysis and Quantity Analsysis

(5) Data Quantity Analysis

Conclusion

In this work, we introduce EXPLORE-INSTRUCT, a novel approach to enhancing domain-specific instruction coverage. Drawing inspiration from classical search algorithms, EXPLORE-INSTRUCT leverages the power of LLMs to actively explore the domain space and obtain diverse and domain-focused instruction-tuning data. Our experimental results demonstrate the efficacy of EXPLORE-INSTRUCT through data-centric analyses and model performance evaluations in the rewriting, brainstorming, and math domains, highlighting significant enhancements in instruction coverage and superior model performance compared to multiple baseline methods as demonstrated by both automatic and human evaluations.

[EMNLP2023] Ensemble-Instruct: Generating Instruction-Tuning Data with a Heterogeneous Mixture of LMs

Mon, 29 Jan 2024 03:45:00 +0000

[pdf] [github]

Young-Suk Lee, Md Arafat Sultan, Yousef El-Kurdi, Tahira Naseem, Asim Munawar, Radu Florian, Salim Roukos, Ramón Fernandez Astudill
IBM Rsearch AI

Abstract

(Data Generation from ICL) Self-Instruct 나 Alpaca 와 같이, ICL 을 활용하여 data 를 generation 하는 것을 통해, 적은 양의 human supervision 으로 모델을 학습시킬 수 있다.
(Limitation) 그러나 이러한 방법은 상표가 있거나 공개되지 않은 175B 정도 크기의 LLM 에 의존(resort) 할 수 밖에 없다.
(Proposed Method) 이 논문에서는 permissive license 를 가지며, 10-40B 정도의 비교적 작은 모델을 가지고도 이러한 Technique 를 구현한다. 저자들은 이 정도 size 에서는 SELF-INSTRUCT 방법이 좋지 못함을 보임과 동시에, 새로운 ICL method 를 제안한다.
((1) Categorization) 첫 번째 idea 는 LM 이 학습하기 쉬운 ICL template 을 categorize 하고 simplify 하는 것이다.
((2) Ensembling) 두 번째 idea 는 여러 LM output 을 앙상블하여, high-quality synthetic example 을 생성하는 것이다.
(Experiment) SELF-INSTRUCT 와 같은 세팅을 ㅗ실험한 결과, SELF-INSTRUCT 보다 더 좋은 퀄리티의 instruction 을 생성하고, 이를 통한 instruction tuning 으로 성능을 더 끌어올렸다.

1. Introduction

▶너무 큰 LLM 에 의존하는 기존 Instruction dataset generation via ICL
Instruction-tuned LLM 은 정말 많은 일을 수행할 수 있다. 이를 위하여 Large-scale instruction-tuning data를 automatic 하게 synthesis 하는 연구가 활발히 진행되고 있다. 예를 들어, SELF-INSTRUCT 는 작은 크기의 expert-seed example 을 ICL(In-Context Learning) 을 통해 bootstrapping 하여 instruction-tuing dataset 을 생성한다. 이 방법은 매우 강력하며, 이를 통해 LLAMA 를 학습한 Stanford Alpaca 등의 follow-up 연구도 많지만, 이러한 것들은 여전히 175B 크기의 LLM 에 의존한다는 단점이 있다.

▶Ensemble-Instruct
이 논문에서는 fully accessible 한 40B 정도의 smaller LM 을 통한 high-quality instruction tuning data generation 을 할 수 있는 Ensemble-Instruct 라는 방법론을 제안한다. 우선, 저자들은 이 정도 크기의 작은 모델에는 SELF-INSTRUCT 방법이 성능이 좋지 못함을 보이고, (1) Categorizating and simplifying the ICL propmt 와 (2) Ensembling over multiple LM output 의 두 가지 방법을 main idea 로 하는 Ensemble-Instruct 방법론을 제안한다.

조금 더 자세하게는, SELF-INSTRUCT 방법이 instruction 을 생성한 후, input first 와 output first 을 통해 instance 를 생성하는 반면, Ensemble-Instruct 는 without input 과 with input 으로 categorizing 하고 이를 위한 prompt 르f simplifying 한다. 이후, heterogenous collection 들을 모은 뒤, majority voting 을 통한 ensemble 방법을 적용한다.

▶Experiment
작은 모델로 사용되는 모델들은 T5 UL2-20B, FALCON-40B, FLAN-T5-11B, FLAN-UL2-20B, GPT-NeoxX-20B(chat-tuned) 등이다. 추후, instruction tuning 을 진행하는 base model 은 Pythia-1.4B 와 MPT-7B (decoder only LM similar to LLaMA), GPT-JT-6B (instructed version of GPT-J) 등이다. 언급된 모든 모델들은 open-source 이며, permissive license (Apache-2)를 갖고 있다.

SELF-INSTRUCT 와 유사하게 SUPERNI 에 test 해 본 결과, 좋은 성능을 보였으며, 생성한 synthetic instruction-tuning dataset 을 release 하였다.

2. Ensemble-Instruct

Ensemble-Instruct 의 overview는 위의 figure 와 같다. 이는 세 가지 main component 로 이뤄져있다: (1) Categorization of tasks and their prompts, (2) Generation of instructions followed by instances (where an instance comprises an input and an output, (3) Ensemble of outputs from multiple LMs.

2.1. Categorization of Tasks and Prompts

저자들은 input 이 필요한 instruction (type-A) 과 input 이 필요하지 않은 instruction (type-B) 로 type 을 나눈다. 아래의 figure 에서 그 예시들을 볼 수 있다. SELF-INSTRUCT 의 시작 175 seed set 을 구분하면, type-A 가 125개, type-B가 50개이다.

2.2. Instruction Generation

Type-A 를 위해서는 24개의 ICL exempler (demonstration) 을 사용하고, 이 때 20개는 125 개 시작 seed task 에서 추출하고, 4개는 앞서 생성된 instruction 에서 randomly sample 한다. Type-B 를 위해서는 10개의 ICL exempler 를 사용하고, 8개는 125 개 시작 seed task 에서, 2개는 이전에 생성된 instruction 에서 생성한다.

역시, SELF-INSTRUCT 를 따라서, ROUGE-L score 가 0.7 이하로 겹치는 것만 남기고 filtering out 한다.

2.3. Instacne Generation

Type-A 를 위해서 18 개의 ICL exempler 를, Type-B 를 위해서 15 개의 exempler 를 사용한다. 위의 Figure 2 에서 Type-A 와 type-B 예시를 볼 수 있다.

2.4. Output Ensembling

지금까지 setting 은 categorization 을 진행한 것 외에는, 사실상 SELF-INSTRUCT 와 크게 다를 것이 없다. 하지만 smaller model 을 사용한 만큼 그 결과가 부정확할 확률이 매우 높다. 따라서, additional set of LM 들의 output 을 앙상블하는 방법론 을 사용한다.

위의 알고리즘과 같이, ensemble 을 진행하는데, 우선 all three pair 를 ROUGE-L score 로 유사도를 측정한다. 만약, 모든 ROUGE-L score 가 threshold 를 넘는다면 (가장 낮은 score 가 thershold 를 넘는다면), 가장 높은 ROUGE-L pair 의 첫 번째 element 를 return 한다. 저자들은 이 것이 Minimum Bayesian Risk decoding 의 greedy version 이라고 한다.

3. Analysis of Instruction Tuning Dataset

생성된 Instruction tuning dataset 의 label(name) 과 그에 사용된 LM 은 Table 1 에서 볼 수 있다.

아래 Table 에서는 generation 에 사용된 LM 모델의 간단한 정보를 요약한다.

3.1. Instacne vs. Output Generation

Table 1 에서 볼 수 있듯이, Instruction/Instance 를 생성하는 LM 과 additional Output 을 생성하는 LM 이 다른 것을 볼 수 있다. 그 이유 첫번째는, 20B 정도의 large decoder-only model 민아 input-output instance (type A) 를 생성할 수 있었 기 때문이다. 따라서 FALCON, GPT-NEOXT-CHAT 이 instance generation 에 사용되었다. 아래의 Table 3 에서 instructed model 인 FLAN-UL2 는 아예 instance 를 생성하지 못한 것을 볼 수 있다.

두 번째로, Instruction-tuned model (FLAN-UL2, FLAN-T5-XXL, GPT-NEOXT-CHAT) 이 high-quality tzero-shot output 을 잘 생성해내기 때문에 , 이 모델들이 additional output generation 에 사용된다. 아래에서 UL2, FALCON 같은 vanilla LM 들은 instructed model 보다 성능이 뒤쳐지는 것을 볼 수 있다.

3.2. Small LM Dataset Comparison

저자들은 Pythia-1.4B-deduped 모델을 instruction-tune 한 뒤, SUPERNI 119 test task 에 적용하여 eval 해보았다. 위의 Table 4 에서 그 결과를 볼 수 있다. 여기서 M-SELF-INST 는 {UL2, NEOX} 에 SELF-INSTRUCT instruction dataset 을 tuning 한 것을, F-SELF-INST 는 FALCON 에 SELF-INSRUCT instruction dataset 을 tuning 한 것을 의미하며, ALPACA 와 SELF-INST 는 SELF-INSRUCT 알고리즘을 더욱 큰 모델 (LLaMA 와 GPT-3) 에 적용한 모델들이다.

SO 는 without 앙상블, EO 는 앙상블 적용 모델이며, {UL2, NEOX} 에서도, FALCON 에서도 모두 SELF-INSTRUCT 알고리즘을 압도적으로 이기는 모습을 보인다. 눈 여겨볼 점은, EO- ILM(ICL 적용하여 Ensemble 한 것)이 앙상블을 하지 않은 SO- 모델들 보다 훨씬 좋았으며, ICL 을 적용하지 않고 앙상블한 EO-LM 은 오히려 SO- 보다 낮은 것을 볼 수 있다. (32.9 vs 34.4)

3.3. Qualitative Analysis

4. Experimental Results

Evaluation Dataset 의 정보는 아래와 같다.

▶ Evaluation results on the SuperNI test set.

MPT-7B 에 적용하였을 때, 30K 정도의 적은 sample 학습을 하였을 때도, 80K sample 을 배운 큰 모델들보다 더 좋은 성능을 보인다.

▶ (SuperNI) Results of GPTJT-6B fine-tuned on synthetic data.

앞선 실험 결과와 비슷한 성향을 보인다.

▶ (user-oriented) Results on the 252 user-oriented test set.

▶ Experimental results with other much larger models to illustrate the scalability of the proposed Ensemble-Instruct to any black-box models.

Conclusion

We present a novel technique to generate instruction-tuning data through ICL, following the recent Self-Instruct work (Wang et al., 2023). Unlike Self-Instruct, we propose techniques that explicitly avoid the use of proprietary language models like GTP-3, ChatGPT or GPT-4. We show that when using smaller models, Self-Instruct becomes less performant. To overcome this, we draw on two main ideas: (a) Categorization and simplification of ICL templates to make prompt learning easier, and (b) Ensembling over multiple LM outputs to select high-quality examples. These ideas allow us to outperform training with Self-Instruct while utilizing the same seed tasks. The resulting synthetic data enables base models like MPT-7B to outperform GPT-3, a far larger model with 175B parameters. The results of this work also encourage the departure from closed-access models for advancing instruction generation algorithms.

Limitations

Due to time and resource constraints, some parts of the experimental setup are not ideal. All model outputs were collected from an internal API serving models from HuggingFace11. Due to limitations of this API, different number of samples were collected for each model which may have introduced noise in the performance estimates. We report the exact number of samples used for training along with the results. Note that for cases using ensembling one has to take into account that there is an additional filtering process that removes samples.
We provide approximate rates for ensembling filtering in Table 3. For the small user-oriented test set containing 252 tasks, automatic evaluation is arguably not ideal. Proper human evaluation would provide a clearer signal but this requires of significant time investment and resources. The method employs a set of various LMs, and therefore the generated synthetic data can be susceptible to the limitations of such LMs, particularly the biases inherent in the training data which may be harmful leading to synthetic data with hate, abuse and social stereotypes

[ACL2023] SELF-INSTRUCT: Aligning Lnaugage Models with Self-Generated Insructions

Fri, 26 Jan 2024 08:00:00 +0000

[pdf] [github]

Yizhong Wang ^♣, Yeganeh Kordi ^♢, Swaroop Mishra ^♡, Alisa Liu ^♣ Noah A. Smith ^♣+, Daniel Khashabi ^♠, Hannaneh Hajishirzi ^♣+
^♣ University of Washington ^♢ Tehran Polytechnic ^♡ Arizona State University ^♠ Johns Hopkins University ⁺ Allen Institute for AI

Abstract

(Instruction Tuning) Instruction 에 respond 할 수 있게 language model 을 finetuning 하는 instruction tuning 을 통해 새로운 task 에 대한 높은 일반화 성능을 부여할 수 있다.
(Lack of dataset) 그러나, human-written instruction data 는 그 양과 다양성, 창의성(creativity) 가 부족하다.
( SELF-INSTRUCT ) 저자들은 LLM 이, 자신이 생성한 generation 을 bootstrapping 하는 기법을 통해 instruction-following 능력을 개선시키는 framework 을 제시한다.
(Pipeline) Pipiline 은 instruction, input, output sample 을 generate 한 뒤, invalid 하거나 이전과 유사한 것들을 filter 하는 형식이다.
(Experiment) 이 방법으로 vanilla GPT-3 에 적용했을 때, private user data 와 human annotation 을 배운 InstructGPT (text-davinci-001) 를 뛰어넘는 성능을 보인다.
(Broader Impact) SELF-INSTRUCT 는 almost annotation-free method 로 PLM 은 instruction 에 aligning 할 수 있게 하며, 추후 instruction tuning 에 사용될 수 있는 synthetic dataset 을 생성하여 release 하였다.

Introduction

Lack of instruction tuning data
최근 NLP 에서는 LLM 의 강력함을 목격하였다. 그 중심에는 두 가지 key component 가 있는데, (1)LLM 모델과 (2) human-written instruction data (e.g. PROMPTSOURCE, SUPERNATURALINSTURCTIONS, SUPERNI) 이다. 그러나 instruction data 를 collecting 하는 것은 매우 costly 하고, annotator 가 입맛에 맞는 task 는 유명한 task들이기에 limited diversity 를 갖는다. 따라서 instruction tuning process 를 위한 대체 방안이 필요하다.

SELF-INSTRUCT
이 논문에서는 모델 자체의 instructional signal 로 부터 instruction tuning 을 진행하는 semi-automated process 인 SELF-INSTRUCT 방법을 제안한다. 위의 그림이 SElF-INSTRUCT 방법론의 overall process 이다.

(1) 우선 제한된 수의 seed task set (human-written) 으로 시작하여, 새로운 new task 를 위한 instruction 을 생성하게 한다. 이 과정에서 기존의 instruction 들의 collection 으로 부터, new task 를 위한 broad-coverage instruction 을 생성하게 한다.
(2) 이후 생성된 instruction 을 바탕으로, 모델은 input 과 output 을 생성하게 된다.
(3) 마지막으로, vlow-quality 와 repeated instruction 을 제거하기 위한 다양한 heuristic 을 통해 filtering 을 진행한다.
(4) 이 과정은 task 의 수가 원하는 정도로 많아질 때 까지 반복된다.

Experiments
저자들은 SELF-INSTRUCT 방법을 Vanilla GPT-3 에 적용하였다. 이 방법을 통해 52K instruction 과 82K input-output pair 를 생성하였다. 위의 그림과 같이 다양한 범위의 creative 한 task 들을 생성하는 것을 볼 수 있다. (typcial NLP task 과구분되는) SELF-INSTRUCT 가 적용된 GPT-3 는 SUPERNI 등의 typical NLP 뿐 아니라, 새로운 instruction task 에 대해서도 InstructGPT001 을 이기는 정도의 성능을 보여준다.

Method

Defining Instruction Data

Instruction $I_t$ 는 $t$-th task 에 대해, input-output instance $(X_{t,i}, Y_{t,i})$ 를 갖는다. 모델 $M$ 은 $M(I_t, X_{t,i})=Y_{t,i}$ 를 생성한다. Instruction 과 Input 사이의 boundary 는 엄격하게 두지 않았다. 예를 들어, “write an essay about school safety” 자체가 instruction 일 수도 있고, “write an essay about the following topic” 이 instruction 이고 “school safety”가 input 으로 주어질 수도 있다. 따라서 Input $X$가 empty 로 주어지는 경우도 나올 수 있다.

Automatic Instruction Data Generation

첫 번째 Figure 에서 볼 수 있듯이, SELF-INSTRUCT 는 네 가지 pipeline 을 가진다.

1) generating task instructions
2) determining if the instruction represents a classification task
3) instance generation with either an input-first or output-first appraoch
4) filtering low-quality data

(1) Instruction Generation
우선, 작은 크기의 seed set 으로 시작한다. 저자들은 175개 task 에 대해, 하나의 instruction 과 하나의 instance 로 시작한다. 각 step 마다 8 개의 task instruction 을 in-context example 로 하여 prompt 를 구성한다. prompt 는 아래와 같다.

(2) Classification Task Identification
Classification setting 이냐 아니냐가 중요한 요소이므로, 두 번째 step 으로는 생성된 instruction 이 classification task 인지 아닌지를 구분한다. 아래 그림의 prompt 를 이용하여 few-shot ICL 로 모델을 이용한다.

(3) Instance Generation
주어진 instruction 를 바탕으로 instance 를 생성하는 단계다. 이 과정이 가장 challenging 한데, 그 이유는 (1) target task 에 대한 이해가 필요하고, (2) input field 에대한 이해와 생성을 해야하며, (3) output 을 완성시켜 생성할 수 있어야하기 때문이다. 즉, 세 가지 단계를 한 번에 해낼 줄 알아야하는 step 이다.

저자들은 우선, instruction-input-output 형태로 주어지는 in-context example 을 통해 LM 이 이 능력을 갖추고 있음을 확인하였다. 이 방법을 저자들은 INPUT-First Approach 로 명명하여 사용한다.

그러나 저자들은 이후, 이 방법이 (classification task 에서 특히) 하나의 label 로 bias 된 output 을 생성한다는 것을 발견한다. 이에 저자들은 OUTPUT-First Approach 를 제안하는데, possible class label 을 먼저 generate 한 후, 이것과 instruction 을 활용해 input 을 생성하게 하는 것이다. Output-first approach 의 prompt 의 일부는 아래의 그림과 같다.

저자들은 classification setting 일 경우 OUTPUT-First approach 를, non-classification setting 일 경우 INPUT-First Approach 를 적용하였다.

(4) Filtering and Postprocessing
이렇게 생성된 새로운 instruction-instance 를 task pool 에 추가하기 전 filtering 과정을 거친다. 우선, 너무 유사한 task 를 방지하기 위해, ROUGE-L similarity 가 0.7 이상이면 거르도록 하였다. 또 exact same Input, Ouput 이 발견된 경우도 거른다. 물론 heuristic 하게, instruction 이 너무 짧거나 너무 길거나 하는 등의 invalid generation 도 filter out 된다.

Finetuning the LM to Follow Instructions

이렇게 생성된 diverse large-scale instruction data 를 학습시킨다. 다양한 task 에 대해 robust 하게 만들기 위하여, 다양한 template 의 prompt 를 사용한다.

SELF-INSTRUCT Data from GPT-3

GPT-3 를 활용하여 SELF-INSTRUCT 를 구현해 생성한 dataset 의 통계이다.

Statistics

52K instruction 과 82K 정도의 instance 가 생성되었따.

Diversity

Berkeley Neural Parser 를 활용해 verb-noun structure 를 구성한 뒤, 분석 한 결과, 아래 그림과 같이 다양한 종류의 instruction 이 생성됨을 볼 수 있다.

또한 시작 point 의 175개 seed task 와 얼마나 겹치게 생성되었는지를 확인하기 위해, 아래의 ROUGE-L overlap 을 보면, overlap 이 크지 않은 창의적인 task 들이 많이 생성된 것을 볼 수 있다.

마지막으로, instruction, non-empty input, output 의 길이에 대한 분석은 아래의 그림에서 볼 수 있다.

Quality

Quality 를 평가하기 위해, 200개의 instruction (각 1개의 instance)를 추출하여 annotator 에게 평가를 시킨 결과, “most of the generated instructions are meaningful” 의 결과를 얻었다.

Experiemental Results

SELF-INSTRUCT 로 생성한 data 를 학습한 모델을 $GPT-3_{SELF-INSTRUCT}$ 라고 명하고 아래의 baseline 들과 비교한다.

Off-the-shelf LMs T5-LM, GPT-3
Publicly available instruction-tuned models T0, T$k$-INSTRUCT
Instruction-tuned GPT3 models INSTRUCTGPT (text-davinci-001)

Experiment 1 : Zero-Shot Generalization on SUPERNI Benchmark

Instruction following task 인 SUPERNI benchmark 에 대한 실험결과이다. 이 실험은 대체로 zero-shot setting 으로 실험하였다.

SELF-INSTRUCT 는 GPT-3 의 instruction-following 을 크게 boost 시킬 수 있다.
InstructGPT001 과 거의 유사한 성능을 보인다.

Experiment 2 : Generalization to User-oriented Insutrctions on Novel Tasks

Practical usage 에 대한 검증을 위하여, User-oriented Instruction set 을 curate 하여 bnechmark 로 활용한다. 우선, email writing, social media, entertainment, programming 등 LLM 이 자주 쓰일만한 분야를 선정한 후, instruction-instance를 craft 한다. 이렇게 252 개의 instruction (각 1개의 instance) 를 생성하였다. 아래 그림에서 small portion 을 볼 수 있다.

실험 결과는 아래와 같다.

A,B,C,D 로 나누어 평가를 한 결과 (A rank 일 수록 좋은 평가) text-davinci-001 정도까지는 유사한 결과를 얻을 수 있었다

Effect of Data Size and Quality

위의 그림의 파란 선에서, data size 가 커질 수록 consistent 하게 성능이 증가하는 것을 볼 수 있다. 그리고, 주황색 점에서, output 을 text-davinci-003 이 생성하게 하여 높은 quality 의 데이터를 생성하게 하였을 때 훨씬 좋은 성능을 얻는다. 이를 통해, 생성되는 data 의 quality 가 좋아질 수록 성능이 좋아질 수 있으므로, 아직 발전의 여지(room)가 이 충분하다고 본다.

Conclusion

We introduce SELF-INSTRUCT, a method to improve the instruction-following ability of LMs via their own generation of instruction data. On experimenting with vanilla GPT3, we automatically construct a large-scale dataset of 52K instructions for diverse tasks, and finetuning GPT3 on this data leads to a 33% absolute improvement on SUPERNI over the original GPT3. Furthermore, we curate a set of expert-written instructions for novel tasks. Human evaluation on this set shows that tuning GPT3 with SELF-INSTRUCT outperforms using existing public instruction datasets by a large margin and performs closely to InstructGPT001. We hope SELF-INSTRUCT can serve as the first step to align pretrained LMs to follow human instructions, and future work can build on top of this data to improve instruction-following models.

[EMNLP2023] Retrieval-Generation Alignment for End-to-End Task-Oriented Diaogue System

Wed, 24 Jan 2024 08:00:00 +0000

[pdf] [github]

Weizhou Shen¹, Yingqi Gao¹, Canbin Huang¹, Fanqi Wan¹, Xiaojun Quan^1*, Wei Bi^2*
¹ School of Computer Science and Engineering, Sun Yat-sen University, China ² Tencent AI Lab

Abstract

(Retrieval for TOD) Localized and specialized task 를 효과적으로 처리하기 위해, Task-Oriented Dialogue(TOD)은 Knowledge Base (KB) 에서 정보를 Retrieval 해온다.
(Retrieval Error) 그러나, ChatGPT 나 T5 등의 generative model 이 KB record 에서 retrieved 된 정보를 처리할 때, 사소한 차이점으로 인한 잘못된 결과를 생성해낸다.
(Proposed Method) 이 논문에서는 maximal marginal likelihood 를 사용하여, response generation 으로부터의 signal 을 통해 perceptive retriever 를 학습시킨다.
(Experiment) 이 방법으로 학습된 retriever 를 T5 와 ChatGPT 를 backbone 으로 하여 싫머을 진행하였을 떄, high-quality 의 knowledge record 로 부터, 좋은 response 를 generate 하는 것을 검증한다.

1.Introduction

TOD : Task-Oriented Dialogue System
Task-Oriented Dialogue System (TOD)은 기차 예약, 스케쥴 조정 등 특정한 목표를 수행을 돕는 시스템이다. 보통 TOD 는 pipeline 과 End2End 형식으로 나뉘는데, DST, Policy 등의 모듈이 나눠져 파이프라인 형식으로 진행이 되거나, 중간 개입 없이 한 번에 response 를 generate 하는 방식이 각각 그것들이다. Pipeline 모델과 그 각 모듈들에 대한 연구가 성행하다가, 최근 Large Language Model (LLM) 의 출현 덕분에, End2End 모델에 대한 관심도가 급증 하고 있다.

RAG : Retrieval-augmented generation
Retrieval-augmented generation (RAG) 은 result 를 generate 하기 위해 외부 지식을 retrieval 하여 활용하는 것을 말한다. Q-TOD 에서는 E2E-TOD 에 RAG 방법을 적용하여 기존의 방법들을 훨씬 뛰어넘는 성능을 보였다. 하지만 저자들의 preliminary study 에 따르면, knowledge retriver 의 perofrmance 와 reponse generator 의 performance 사이의 correlation 은 상당히 약하며, 이 것은 retriever 을 imporve 한다고 해서 전체적인 generation 성능이 좋아지는 것은 아니라는 것을 의미한다. 저자들은 이 현상을 misalignment between retireval and generation 이라고 명명한다. 이 현상이 최근 E2E-TOD 의 발목을 잡는 bottleneck 이다.

Qualitative analysis
Qualitative analysis 로, 저자들은 이 misalignment 는 homegeneity of retreived knowledge entity 때문이라고 가정한다. 위의 Figure 1 과 같이, retrieved 되어 온 entity 들은 약간의 차이점을 제외하고, 높은 수준의 유사성을 보인다. 결과적으로, reponse generator 는 knowledge-related token 보다, 학습된 language token 에 predominant 하게 반응하여 결과를 생성해낸다고 본다.

MK-TOD : Meta Knowledge for TOD
이에 저자들은 Meta Knowledge for end-to-end Task-Oriented Dialogue system (MK-TOD) 을 제안한다. MK-TOD 는 언급된 misalignment 를 해결하는 것을 목표로 한다. 우선, maximal marginal likelihood 방법을 통해 retriever 가 학습 과정 내내 progressive 하게 학습되도록 하였고, response generator 가 entity 들을 잘 구분하게 하기 위하여, meta knowledge 를 사용할 수 있는 능력을 갖추게 한다. 여기서, meta knowledge 는 retrieval 에 관련된 추가 정보로, retrieval order, retrieval confidence, co-occurrence rate 으로 이뤄진다. Meta knowledge 는 세 가지 접근법을 통해 학습을 진행해 보는데, (1) special prefix token 을 추가, (2) prompt 활용, (3) contrastive learning 적용이다. 또 추가적으로 generator 가 discriminative ability 를 갖게 하기 위하여, negative knowledge 를 사용하는 방법도 실험한다.

Experiments
MK-TOD 를 T5 와 ChatGPT 모델을 backbone 으로 하여 적용한뒤, 다른 E2E-TOD system 과 비교실험을 진행한다. SMD, CamRest, Woz 같은 benchmark dataset 에 대하여 제안된 system 이 기존의 SOTA 를 뛰어넘는 성능을 보인다. 또한, MK-TOD 를 통해 ChatGPT의 in-context learning 을 효과적으로 향상시킬 수 있음을 보인다. 추가적인 분석으로, meta-knowledge 의 학습이 misalingment 개선에 큰 도움이 되는 것을 보인다.

2. Methodology : MK-TOD

MK-TOD 방법에 대한 개요는 Figure 2 에 나와있다. 각 turn 을 생성하는데 있어, retirever 가 relevant entity 를 추출해 온 후, meta knowledge 와 결합하여 reponse generator 에 주어진다. reponse generator 는 이 정보들을 활용하여 한 문장씩 생성하며, normal text generation likelihood 과 marginal likelihood 를 모두 높인다.

2.1. Notations

Dialog $D = (u_1,r_1, …, u_T,r_T)$ with $T$ turns where $u_t$ and $r_t$ are the $t$-th turn user and system utterances.
Context $c_t = (u_1,r_1, …, u_t)$
External KB $K = (e_1, e_2, …, e_B)$ with knowledge base size $B$
$r_t$를 생성하기 위해, $c_t$ 와 $K$ 를 입력으로 받는다.

2.2. System Overview

Retrieve module 속의 context encoder 가 context 를 embedding 한 후, 그 embedded vector 를 eneity encoder 가 encode 한 KB 속의 knowledge vector 들과 score 를 매겨, Top-K 를 뽑아온다. 이후, 이 Top-K entity 들은 Meta Knowledge 과 concate 하여 response generator 에 주어진다.

Response generator 는 위의 식을 통해 각각 entity $\epsilon_t$ 에 따른 response 의 prob 을 생성한다. 이후 negative log-likelihood (NLL) 를 통해 text generation 을 학습한다.

2.3. Maximum Marginal Likelihood

학습과정에서 Retrieval label 이 없기 때문에, generator 로부터의 supervision signal 을 활용하여 retriever 를 학습한다. 그러나, 위의 NLL loss 를 backpropagate 하는 것은 불가능하기 때문에, Maximum Maringal Likelihood (MML) 를 활용한다.

Knowledge base 속의 entity 의 likelihood 를 모두 통합하여 knowledge 전체의 Bayesian perspective 를 제공한다. $\pi$는 retriever 의 parameter 고, $q$는 entity 하나 에 대한 retrieval prob 이다. 이 때, 모든 entity 에 대한 $q$ 계산이 cost 가 들기 때문에, EMDR 의 방법과 마찬가지로, retrieved 된 Top-K $\epsilon$ entity 로 대체한다. 따라서 아래처럼 $K$ 대신 $\epsilon$ 으로 식이 변경된다.

이에 따라, MML loss 는 아래와 같으며,

최종적으로 hyperparam $\alpha$ 와 $\beta$ 를 통해 아래의 최종 loss 로 합쳐진다.

2.4. Meta Knowledge

Generator 가 올바른 entity 를 선택하도록 guide 하여, misalignment 를 해결하기 위해 meta knowledge 를 도입한다. Meta knowledge 는 다음 세 가지로 이뤄진다 : (1) Retrieval order, (2)Retreival confidence, (3)Co-occurrence Retrieval confidence 는 hyper param 을 기준으로 세 가지 (high, middle, low-confidence) 로 나눠 구성되고, Co-occurence 는 이전의 context 에서 나왔던 entity 에 대해 그 정보를 기록한다.

앞 서 말했듯 이 meta knowledge 를 적용하기 위해 세 가지 approach 를 design 한다.

(1) Prefix
Special token 을 활용한 mapping function 이다. 예를 들어, second ranking - middle confidence - not yet mentioned in the context entity 라면, <2nd-entity>, , 이렇게 mapping 한다.

(2) Prompt
“This is the top-1 recalled entity with low confidence” 와 같은 형식으로 prompt 를 추가하는 방법이다.

(3) Contrastive Learning
Retriever 로 부터 추출된 $\epsilon$ entity 들을 활용한 contrastive learning 형식으로 generator 를 학습할 수 있다. ※ Contrastive learning 에 대한 자세한 내용은 논문 참고.

2.5. Negative Entity

Information Retrieval 에서의 negative sampling 기법과 같이, $K$ 에서 가장 낮은 score 를 보이는 entity 하나를 nagative neity 로 하여, special negative meta knowledge 와 함께 학습에 활용한다.

3. Experimental Settings

3.1. Datasets and Metrics

Datasets

MultiWOZ 2.1 (MWOZ)
CamRest
Stanford Multi-Domain (SMD)

KB 는 session-level 일 수도 있고, dataset 전반에 걸쳐 공통으로 사용되는 것일 수도 있다.

Metrics

BLEU
Entity F1
Recall@K

3.2. Implementation Details

Retriever : BERT
Generator : T5, ChatGPT (in-context learning) ※ ChatGPT : Retriever 는 T5 를 이용하여 학습된 retriever 를 활용한 inference 만 진행

3.3. Basline Methods

Implicit Retrieval

DF-NET
EER
FG2Seq
CDNET
GPT-kE

Explicit Retrieval

Q-TOD
DialoKG
MAKER

LLM

ChatGPT (gpt-3.5-turbo)

4. Results and Anlaysis

4.1. Overall Results with Large-Scale KBs

MK-TOD 가 적용된 같은 scale 의 generator model 은 모두 그렇지 않은 generator 를 능가한다. Q-TOD 의 경우 Recall@K 에서 강점을 보여 좋은 retriever 를 갖고 있는 것을 보이지만, BLUE 와 Entity F1 에서 제안된 방법이 앞서, MK-TOD 를 활용한 방법이 retireved 된 knolwedge 를 더 잘 활용한다는 것을 검증한다. CamRest 의 경우, T5-Base 가 T5-Large 보다 성능이 좋은데, 이는 CamRest 의 training data size 가 작기 때문이라고 추정한다.

아래 쪽의 ChatGPT 결과를 보면, relying solely on in-context learning does not enable ChatGPT to perform as well as the fine-tunable methods in the context of E2E-TOD 라는 해석을 얻을 수 있다.

4.2. Overall Results with Condensed KBs

Condensed KB 에 대한 결과는 위의 Table 과 같다. 앞선 결과와 비슷한 경향의 결과를 보인다.

4.3. Retrieval-Generation Misalignment

Section 4.1. 의 baseline 들로 misalignment 에 대한 분석을 진행한다. Solid line 을 통해 retriever performance 의 증가가 generator 의 performance 증가로 이어지지 않는 것을 볼 수 있다. 특히, oracle entity 를 활용한 결과가 오히려 weak retriever 보다 안좋은 경우도 보인다. 이러한 misalignment 에 대하여, dahsed line 을 보면, MK-TOD 를 적용한 방법은 이 현상이 보이지 않는다.

4.4 Ablation Study

Maximum Marginal Likelihood

세 가지 approach 에서 MML 은 항상 중요한 요소이다.

Types of Meta Knowledge

세 가지 type 을 모두 쓰는 것과 single type 을 쓰는 것이 큰 차이를 보이지는 않는다.

Negative Samples

Negative Sample 을 학습 과정에 사용하는 것이 T5-Base 의 성능을 증가시킨다. 그러나 ChatGPT 에 적용 되었을 때는 negative sample 이 contribute 하지 않는다.

4.5. Behavior of Generator

MK-TOD를 활용하여 학습된 Generator 는 retrieval order 와 confidence 가 높은 것을 최대한 활용하려 하는 경향성을 보인다. 이를 통해 Meta knowledge 가 entity 를 prioritize 하는 inductive bias 를 통해 좋은 성능을 이끌어내는 것을 볼 수 있다.

5. Conclusion

This paper aims to address the retrieval-generation misalignment in end-to-end task-oriented dialogue systems by introducing maximal marginal likelihood to train a perceptive retriever that leverages signals from response generation. To enable the response generator to better distinguish between entities, we explore several methods for incorporating retrieval-related meta knowledge. We also propose to incorporate negative entities to enhance the discriminative capability. Experimental results demonstrate that when combined with meta knowledge, the response generator effectively leverages high-quality retrieval knowledge, leading to enhanced quality in the generated responses. Through analysis, we observe that previous retrieval-augmented generator models suffer from severe retrieval-generation misalignment, while our method mitigates this misalignment.

[EMNLP2023] Active Retrieval Augmented Generation

Fri, 12 Jan 2024 08:00:00 +0000

[pdf] [github]

Zhengbao Jiang ^1*, Frank F. Xu ^1*, Luyu Gao ^1*, Zhiqing Sun ^1*, Qian Liu ², Jane Dwivedi-Yu ³, Yiming Yang ¹, Jamie Callan¹, Graham Neubig¹
¹ Language Technologies Institute, Carnegie Mellon University ² Sea AI Lab ³ Meta AI Research

Abstract

(Hallucination) 최근 LLM 이 remarkable ability 를 보여주지만, inaccurate output 을 생성하는 hallucination 의 경향성을 보인다.
(One-step retrieval and Weakness ) 이를 해결하기 위하여 최근 retrieval-augmented LM 이 연구되었지만, 이들은 대부분 단 한 번만 정보를 retrieval 해와 retrieve-and-generate setup 을 구현한다. 이 방법은 정보를 지속적으로 가져와야 할 필요가 있는 long text generation 에 취약하다.
(Multi-step retrieval and Weakness ) 이에 따라, 다시 여러 번 retreival 을 해와 output 을 생성하는 연구 또한 제안되었지만, 이들은 fixed interval 에 document 를 retrieval 해온다.
(Active RAG) 저자들은 active 하게 when and what to retrieve 를 결정하는 active retrieval augmentated generation 을 제안한다.
(FLARE) 이를 바탕으로 Forward-Looking Active REtrieval (FLARE) 를 제안한다. 이는 low-confidence token 에 대하여, 미래에 필요할 정보를 retrieval 해오는 retrieval-augmented generation method 이다.
(Experiment) 4 개의 long-form knowledge-intensive generation task dataset 에 대하여 FLARE 가 superior or competitive performance 를 보여준다.

Introduction

Generative LM (GPT-3,instructGPT,GPT-4,PAlm,RAG,LLama 등) 는 최근 NLP system 에서 foundamental component 이며 언어를 이해하고 생성하는데 있어서 remarkable ability 를 보여준다. LM 이 training 과정에서 엄청난 양의 world knowledge 를 학습하지만, 그들은 여전히 imaginary content 를 생성하는 hallucination 문제가 있다. ([1], [2], [3]) 이러한 hallucination 을 극복하는 방법으로, retrieval 을 이용하는 방법에 제안된다. 이 non-parametric retrieval component 를 parametric LM 에 augmenting 하는 방법으로 external knowledge 를 LM 에 부여하는 방법들이 많이 제안되었다.(RAG, FiD, kNN-LM, Atlas, ReAtt, REPLUG 등)

이러한 Retireval-augmented LM 은 보통 retrieve-and-generate setup 을 활용하여, user’s input 에 기초한 document 를 retrieval 해온 뒤, complete answer 를 generate 한다. 이러한 single-time retrieval-augmented LM 들은 paramter-only LM (no retrieval)의 성능을 크게 뛰어넘었지만, factoid QA 혹은 fact-checking 와 같은 short-form knowledge intensive paradigm 에서만 잘 작동한다. 이러한 short-form generation 의 특징은 user’s input 에 연관된 정보가 매우 clear 하고, input 에 기반한 relevant knowledge 를 단 한 번만 retrieval 해와도 충분하다는 것 이다.

최근 long-form QA (ELI5, ASQA), open-domain summarization, 그리고 CoT 와 같은 long-form output 을 생성하는 능력에서도 LLM 은 좋은 성능을 보여준다. 이러한 long-form QA 의 특징은 answer 를 얻기위한 complex information 들이 input alone 에 항상 evident 하지 않다는 것 이다. 인간이 paper, essay, book 을 쓸 때와 마찬가지로 LM 역시 generation 과정에서 필요한 knowledge 들을 여러번 gathering 해올 필요가 있다. (would require gathering multiple pieces of knowledge throughout the generation process) 예를 들어, open-domain summaraization ([4]) 에서, initial retreival 은 topic name (e.g. Joe Biden) 에 기반핥테지만, 이들은 모든 aspect 와 detail 을 포함할 수 없다. 따라서 generation process 중간에 extra-information 을 retrieval 해올 필요가 있다.(e.g the education history of Joe Biden)

이렇게 multiple time retrieval 을 해오는 system 을 build 하려는 노력 역시 여러 연구를 통해 존재한다. 이러한 시도들은 past context 를 passively 활용하여, fixed interval 에 additional information 을 retrieval 해온다. (knn-LM(ICLR2020), RETRO(ICML2022), RALM, IRCoT(ACL2023)) 이 들은 LM 으로 하여금 미래의 generation 과정을 accurately reflect 하거나, inappropriate point 에서 retrieve 해온다.
몇몇의 work 들은 multi-hop QA 에서 full-question 을 decomposing 한다. (Self-Ask, ReAct, DecomP, DSP)

저자들은 follwing question 에 대해서 대답한다 : can we create a simple and generic retireval-augmented LM that actively decides when and what to retrieve throughout the generation process . 저자들은 when to retrieve 를 알아내는 것이 unneccsary or inappropriate knowledge retreival 을 줄이는 과정이라고 설명한다. LLM 이 lack of knowledge 에서 low probability confidnce 를 보이고 well-calibrate 를 하려는 시도를 한다는 발견([6],[7])에서, 저자들은 low-probability token 을 LM 이 generate 하려 할 때 retrieval 을 해오는 strategy 를 택한다.

What to retrieve 를 결정할 때는, LM 이 미래에 generate를 하려는 것을 고려하는 것이 매우 중요하기 때문에, future generation 에 benefit 을 주는 것이 acitve retrieval 의 goal 이다. 따라서, 저자들은 temporary next sentence 를 생성한 이후에, 이것을 query 로 하여 relevant document 를 retrieval 해오고, 이후 이 retrieved document 를 활용하여 regenerating 하여 sentence 를 만든다. 이 두 가지 면 (when and what to retrieve) 를 반영하여 저자들은 Forward-Looking Active Retrieval augmented generation (FLARE) 라는 방법론을 제안한다. FLARE iteratively generates a temporary next sentence, use it as the query to retrieve relevant documents if it contains low-probability tokens and regenerate the next sentence until reaches the end.

FLARE 는 어떠한 LM 에도 적용가능하지만, GPT-3.5 (text-davinci-003)를 활용하여 variety of task 에 적용하였을 때, 매우 좋은 성능을 보여준다 :multihop QA (2WikiMultihopQA), commonsense reasoning (StrategyQA), long-form QA (ASQA) 그리고 open-domain summarization (WikiAsp)

Retrieval-Augmented Generation

Notations and Definitions

Given user input $x$, document corpus $D$ 에 대하여, retrieval-LM 의 goal 은 $y=[s_1, s_2, …, s_m] = [w_1, w_2, …, w_n]$ 을 추출 하는 것이다. ($m$ 개의 문장 혹은 $n$ 개의 token) Retrieval 을 활용하기 때문에, $y=LM([D_q, x])$ 가 된다. (where $D_q = ret(q)$ with query $q$).

Single-time Retrieval-Augmented Generation

Single-time retrieval-augmented LM 모델은 user input $x$ 를 query $q$ 로 하여, 직접적으로 단 한 번만 retrieval 을 이용한, $y=LM([D_q, x])$ 의 형태가 된다.

Activer Retrieval Augmented Generation

Active RAG 의 formulation 은 다음과 같다. Step $t$ 에 대하여, retrieval query $q_t$ 는 input $x$ 와 그 전까지 생성된 generated output $y_{<t} = [y_0, …,. y_{t-1}]$ 에 의존한다. 따라서 query $q_t = qry(x,y_{<t})$ 가 된다(where qry is the query formulation function). 처음 시작 때는 query 가 input 이다 ($q_1 = x$). 따라서, 최종적으로 output 은 $y_t = LM([D_{q_t}, x , y_{<t}])$ 가 된다.

FLARE: Forward-Looking Activer REtrieval Augmented Generation

저자들은 두 가지를 가정한다: (1) necessary 정보를 가져올 필요가 없을 때 Retrieval 을 해올 필요가 없으며, (2) future generation 의 intent 를 반영하여 query 가 구성되어야 한다는 것이다. 이 들을 고려하여 FLARE method 를 제안한다. Toolformer 의 영감을 받아, retrieval query 를 생성하기 위해 LM 에게 instruction prompt 를 부여하는 $FLARE_{instruct}$ 방법과, LM 의 생성결과를 direct search query 로 사용하는 $FLARE_{direct}$ 두 가지 방법이 있다.

A FLARE with Retrieval Instructions

첫 번째 방법은 Toolformer 에서 그러한 것처럼 “[Search(query)]” 를 통해 필요한 정보를 retrieval 해오는 것이다. (e.g, “The colors on the flag of Ghana have the following meanings. Red is for [Search(Ghana flag red meaning)] the bloodof martyrs, …”) GPT-3.5 model 에 few-shot prompting 을 통해 이 행동을 elicit 한다.

이 행동을 위해 두 가지 스킬이 필요한데, 하나는 seacrh query 를 만드는 skill 을 instruction prompt 로 알려주는 것이고, 다른 하나는 LM 이 answer 를 생성하여 downstream task 를 해결하게 하는 instruction 이다. instruction 에 관한 prompt 들은 아래의 그림과 같이 정리된다.

아래의 그림과 같이, LM 이 “[Search(query)]” 를 생성하면, generation 을 멈추고, query term 을 통해 relevant document 를 retreival 해온다. 미래의 user input 전체에 prepend 되기 때문에 future generation 에 도움이 된다.

저자들은 LM 이 이 두 가지 skill 을 효과적으로 combine 하여 meaningful 하게 search query 를 생성하고 task 를 수행하는 것을 확인한다. 그러나, 여기에는 두 가지 issue 가 있다: (1) LM 은 필요한 것보다 적게 search query 를 생성하기도 하고, (2) 지나친 (excessive) search query 를 생성하는 것은 answer generation 을 방해하여 perforamnce 에 부정적인 영향을 미친다는 것이다.

각 문제를 해결하기 위해 저자들은 두 가지 방법을 각각 적용했는데, 첫 번째로는 ”[“ token 의 logit 을 2.0 으로 만들어, “[Search(query)]” 가 최대한 많이 나오게끔 한다. 두 번째로, 한 번 “[Search(query)]” 를 통해 search 가 이뤄진 이후에는 next few token 안에 다시 “[Search(query)]” 가 나오지 않게끔 “[” 에 large negative logit 을 부여한다.

Direct FlARE

$FLARE_{instruct}$ 는 LM 에만 의존하는 방법이므로, black-box LM 을 fine-tune 하지 못한다면, retrieval instruction 을 통해 생성된 query 에 대한 reliablity 를 가질 수 없다. 따라서 저자들은, 직접적으로 retreival 하는 방법론도 제안한다.

Confidence-based Active Retrieval : Figure 1 과 같이, step $t$ 에서, retrieval 과정 없이 temporary next sentence $\hat{s_t} = LM([x,y_{<t}])$ 를 생성한다. 이후 $\hat{s_t}$ 를 통해, retrieval 을 trigger 할지 안할지를 결정한다. 만약 LM 이 $\hat{s_t}$ 에 retrieval 을 통한 additional information 없이도 충분히 confident 하다면, 그대로 문장을 완성한다. 그렇지 않다면, $\hat{s_t}$ 를 통해 $s_t$ 를 재생성(regenerate)한다. 이를 결정하는 것은 threshold $\theta$ 이다. 정리하면 실제 output sentence $y_t$ 는 아래와 같이 생성된다.

Confidence-based Query Formulation : 정보 검색을 수행하는 한 가지 방법은 직접 다음 문장 $\hat{s_t}$을 검색 쿼리 $q_t$ 로 사용하는 것이다. 이것은 생성된 hypothetical 제목 또는 단락을 사용하는 기존 방법([8],[9])과 유사한 접근 방식을 공유한다. 이러한 방법은 원래 입력 질문 대신 언어 모델의 생성물을 검색 쿼리로 사용하는 것이다 ([8], [10]). 우리는 이러한 기술을 활용하여 long-form generation 에 적용한다.

Empiricially, next sentence을 사용한 검색이 previous context 을 사용한 검색보다 훨씬 우수한 결과를 얻는 것으로 나타났다(이러한 결과는 6.2 절에서 자세히 설명할 예정이다). 그러나 이것은 그 안에 포함된 오류를 계속 전파할 위험 이 있다. 예를 들어, 언어 모델이 “조 바이든은 펜실베니아 대학에 다녔다”라는 정확하지 않은 정보를 생성하면 올바른 사실인 그가 델라웨어 대학에 다녔다는 대신에 이 오류 포함 문장을 쿼리로 사용하면 검색기가 관련 없는 정보를 검색하게 할 수 있으며, 이는 future generation 을 잘못 이끌 수 있다. 이 문제를 극복하기 위한 두 가지 간단한 방법을 Figure 3 에서 설명하고 있다.

Masked sentences as implicit querie : 첫 번째 방법은 $\hat{s_t}$ 내에서 신뢰도가 낮은 토큰을 임계값 β ∈ [0, 1] 아래의 확률로 마스킹 처리한다. 높은 β는 더 강력한 마스킹을 의미하며, 이로 인해 문장에서 잠재적인 distraction 요소가 제거되어 검색 정확도가 향상된다.

Generated questions as explicit queries : 다른 방법은 $\hat{s_t}$ 의 확신이 낮은 span 을 대상으로 명확한 질문을 생성하는 것이다. 예를 들어, 만약 LM 이 ‘펜실베니아 대학교’에 대해 확신하지 못한다면, ‘조 바이든은 어떤 대학을 다녀왔나요?’와 같은 질문은 관련 정보를 검색하는 데 도움이 될 수 있다. Self-ask (Press et al., 2022) 는 이를 수행하기 위해 프롬프트 4.1 (뒤에 등장)에서 나중에 나오는 downstream task exemplar 에 직접 follow-up 질문을 수동으로 삽입하는 방식으로 이루어져 있으며 이는 작업 additional annotaion 을 필요로 한다. Specifically, 저자는 추가적인 어노테이션 없이 낮은 확신 스팬에 대한 질문을 생성하는 범용적인 방법을 개발했다. 구체적으로, $\hat{s_t}$에서 β 아래의 확률로 모든 span을 추출한 다음 각 추출된 span $z$에 대해 답할 수 있는 질문 $q_{t,z}$를 생성하도록 GPT-3.5-turbo에 프롬프트를 지시한다. 프롬프트는 아래와 같다.

이후 저자들은 generated question 과 returned document 를 통해 answer 를 생성한다. 정리하면 $\hat{s_t}$ 를 위한 $q_t$ 는 아래와 같다.

Implementation Details

Method 검증을 위해, GPT-3.5 LM 인 text-davinci-003 을 이용하여 API 를 반복적으로 query 하여 확인한다.

Inital qeury 시작 query 는 FLARE 가 user input $x$ 를 통해 문서를 검색하고, 첫 번째 문장인 $\hat{s_1} = LM([D_x, x])$ 를 생성하여 반복적인 생성프로세스를 시작한다.

Sentence tokenization

각 step $t$ 마다 대부분의 문장보다 긴 64개의 토큰을 생성하고, NLTK 문장 토크나이저를 사용하여 첫 번째 문장을 추출하고 나머지는 삭제한다.

Document corpus and retrievers 이 연구에서는 retrieval과 generation의 통합에 중점을 두고 있기 때문에, 입력으로 query를 받고 relevant document list 를 반환하는 off-the-shelf retriever를 사용한다. Wikipedia에서 지식을 주로 활용하는 데이터셋의 경우, Karpukhin et al. (2020)의 Wikipedia 덤프를 사용하여 문서 코퍼스로 사용하며, 문서는 100-토큰 단위로 분할되고 BM25 (Robertson and Zaragoza, 2009)를 retriever로 사용한다. Open-web 에서 지식을 활용하는 데이터셋의 경우, Bing 검색 엔진을 retriever 로 사용한다.

Retrieved document formatting Multiple retrieved document 는 그들의 순위에 따라 linearized 되어 user input 의 시작부분에 다음 형식으로 추가된다:

Multi-time Retrieval Baselines

기존의 passive multi-time retrieval augmented LM 들 역시 FLARE framework 를 사용하여 formulate 될 수 있다. 이 연구에서는 세 가지 baseline category 를 introduce 한다. 이 baseline 은 이전 작업들이 다양한 디자인 선택을 가져가기 때문에, 직접적인 비교가 불가능하기 때문에 공식적인 reproduction 결과는 아니다. 저자들은 관련 없는 디자인을 제외하고 동일한 설정을 사용하여 구현되도록 하고, 유일한 차이점은 when and what to retrieve이다.

Previous-window 는 모든 직전의 $l$ 개의 token 을 query 로 사용한다. RETRO 와 IC-RALM, 그리고 KNN-LM 이 여기에 속한다. (KNN-LM 의 경우 모든 token 에 대해 retrieval 진행)

Previous-sentence 는 모든 sentnece 에서 retrieval 을 진행한다. IRCoT 가 여기에 속한다.

Question decomposition 은 LM 으로 하여금 sub-question 으로 decompose 하여 question 을 여러 query 로 나눠서 retireval 하게 한다. Self-ask 가 이러한 category 에 속하며, 아래의 prompt 를 통해 이뤄진다:

위에서 언급한 세 가지 method 들은 모두 generation process 에서 additional information 을 검색할 수 있다. 그러나 그들은 notable drawback 을 가지고 있다: (1) Fixed interval approach 는 이전에 생성된 token 을 query 로 사용하며, 이는 LM 이 미래에 생성하려는 내용을 반영하지 못할 수 있다. (2) Fixed interval 에서 정보를 검색하는 것은 부적절한 시점에서 발생할 수 있기 때문에 비효율적일 수 있다. (3) Query decomposition 방법은 task-specific prompt engineering 이 필요하며, 이는 새로운 task 에서의 generalization 이 제한된다.

Experimental setup

FLARE 의 효과를 검증하기 위해 저자들은 few-shot in-context learning (ICL) 을 사용하여 4 가지 task 에 적용한다. Fair comparison 을 위하여, FLARE 의 결과를 동일한 setting, 즉 동일한 context exemplar, prompt format, retriever, 그리고 document corpus 에서 비교한다. Cost 문제로, 각 데이터셋에서 최대 500 개의 예시를 하위 샘플링하는 IRCoT 방법을 따른다. FLARE 의 hyper-parameter 는 dev set 을 통해 선택되며 아래 표와 같다. 특별히 명시되지 않는한, FLARE 는 $FLARE_{direct}$를 나타낸다. Previous-window approach 의 경우, Ram et al.2023 을 따라 $l=16$ 의 window size 를 사용한다.

[Dataset 설명은 생략]

Experimental Results

Comparison with Baselines

여러 task 와 datset 중 multihopQA 에서 눈에 띄는 향상이 보인다. 이는 주로 task의 명확한 정의와 final answer 을 2 단계 추론 과정을 통해 생성해야하는 구체적인 목표 때문에, LM이 주제에 관련된 결과물을 생성하기가 더 쉬워지기 때문이다. 이와 대조적으로, ASQA와 WikiAsp는 덜 명확하게 정의되어 있으며 더 개방적(open-ended)이며, 이는 생성과 평가의 어려움을 증가시킨다. ASQA-hint의 개선은 ASQA보다 큰데, 모호한 측면을 식별하는 것은 많은 경우에 인간에게도 어려운 일이며, 일반적인 힌트를 제공하면 LM이 주제를 유지하는 데 도움이 된다.

Thorough comparisons with baselines

2WikiMultihopQA에 대한 모든 baseline 성능은 Table 1에서 볼 수 있다. FLARE은 모든 베이스라인 대비 큰 차이로 우수한 성능을 보이며, 이는 미래를 내다보는 액티브 검색이 매우 효과적임을 확인한다. 대부분의 Multi-time retrieval-augmented 방식이 single-time 보다 우수한 결과를 보이지만 그 간격은 다르다. Previous-sentence 을 사용하여 검색하는 개선은 비교적 작은데, 이는 2WikiMultihopQA의 다음 문장과 다른 entity 나 관계를 자주 설명하기 때문이라고 추측한다. 반면, Previous-window 접근 방식은 두 번째 절반을 생성하는 데 도움이 될 수 있는 정보를 검색하기 위해 문장의 첫 절반 부분을 쿼리로 사용할 수 있습니다. 모든 베이스라인 중에서 Query Decompoistion 인 Self-ask 가 가장 우수한 성능을 달성한다. 이는 in-context exexmplar 가 분해된 하위 질문(Prompt 4.1)으로 manually annotation 이 달려 있어 LM 이 미래 생성의 주제/의도와 일치하는 적절한 하위 질문을 생성하도록 안내되기 때문이다. FLARE은 이 베이스라인을 능가하며, manual exemplar annotation 이 미래를 고려한 효과적인 검색에 필요하지 않음을 나타낸다. $FLARE_{instruct}$와 Query decomposition 간의 차이는 크며, task-generic retreival instruction 과 exemplar 를 사용하여 LM 에게 검색 쿼리를 생성하는 방법을 가르치는 것이 어려움을 나타낸다.

다른 데이터셋에 대한 모든 metric 들은 Table 2에 있다. 다시 한 번, FLARE은 모든 지표에 대해 베이스라인을 능가합니다. Previous-window 을 사용한 검색은 ASQA 에서 single-time retrieval 보다 성능이 낮습니다. 이는 previous-window 가 사용자의 미래 의도를 정확하게 반영하지 못하기 때문이라고 가설을 세우고 있다. 저자들은 생성의 Factuality 를 평가하는 데 중점을 둠으로써 EM, Disambig-F1, UniEval과 같이 사실적인 콘텐츠를 강조하는 지표가 모든 토큰을 기반으로 계산된 지표(ROUGE-L 등)보다 더 신뢰성이 있다고 여긴다.

Ablation study

Importance of forward-looking retrieval

저자는 forward-looking 검색이 past-context-based retrieval 보다 실제로 강력한지 여부를 먼저 확인한다. 2WikiMultihopQA 및 ASQA-hint 데이터셋에서 ablation study 를 수행하여 previous 문장 대신 next 문장을 사용한 검색을 비교한다. 이때 두 가지 방법은 검색에 사용되는 쿼리를 제외하고 동일하다. 구체적으로, 두 가지 방법은 각 문장을 검색하고 검색에 전체 문장을 직접 사용한다. (마스킹 또는 질문 생성 없이). 위의 Table 3 에서 볼 수 있듯이, 두 데이터셋 모두에서 다음 문장을 사용한 검색이 이전 문장을 사용한 것보다 훨씬 더 나은 결과를 나타낸다.

Importance of active retrieval

Threshold θ 와 performance 의 관계를 조사한다. 아무 것도 검색하지 않는 것(θ=0)에서 모든 문장을 검색하는 것(θ=1)으로 FLARE 방법을 변경하기 위해 검색을 트리거할 때 사용되는 θ를 0에서 1로 조정했다. 모든 thershold 에 대해 검색이 trigger 되는 단계/문장의 percentage 을 계산하고 검색의 percentage 을 기반으로 성능을 표시한다. Figure 5에서 볼 수 있듯이, 2WikiMultihopQA에서는 검색 비율이 60%를 넘어가면 성능이 안정화되며, LM d이 확신을 가질 때 검색이 필요하지 않음을 나타낸다. StrategyQA에서는 검색 비율이 50%를 넘어가면 성능이 하락하며, 고신뢰 문장을 검색에 사용하면 noise 가 끼고 원래 생성 프로세스를 방해할 수 있음을 시사한다. Task/Dataset에 따라 평균적으로 문장의 40%-60%에 대한 검색 트리거가 성능을 향상시키는데 일반적으로 좋은 결과를 나타낸다.

Effectiveness of different query formulation methods

마지막으로, Masking 을 통한 implicit query formulation 과 question generation 을 통한 explicit query formulation 에 대해 연구한다. Table 4에서 다른 threshold β로 FLARE의 성능을 비교한다. 완전한 문장을 직접 검색하는 것(β = 0)은 낮은 확률로 마스킹된 토큰보다 성능이 나쁘며, 낮은 신뢰도의 error token 이 retriver 를 distraction 할 수 있다는 것을 검증한다. 또한 implicit 및 explicit query formulation 방법을 Table 5 에서 비교한다. 두 방법의 성능은 유사하며, 두 방법 모두 정보 요구를 효과적으로 반영할 수 있다는 것을 나타낸다.

Conclusion

To aid long-form generation with retrieval augmentation, we propose an active retrieval augmented generation framework that decides when and what to retrieve during generation. We implement this framework with forward-looking active retrieval that iteratively uses the upcoming sentence to retrieve relevant information if it contains lowconfidence tokens and regenerates the next sentence. Experimental results on 4 tasks/datasets demonstrate the effectiveness of our methods. Future directions include better alternatives for active retrieval and developing LM architectures for efficient active retrieval augmentation.

[NeurIPS2023] Meta-in-context learning in large language models

Wed, 10 Jan 2024 03:42:00 +0000

[pdf] [github]

Julian Coda-Forno ^1,2,∗, Marcel Binz ¹, Zeynep Akata ², Matthew Botvinick ³, Jane X. Wang ³, Eric Schulz ¹
¹ Max Planck Institute for Biological Cybernetics, ² University of Tübingen - Tübingen, Germany ³ Google DeepMind - London, United-Kingdom

Abstract

(Meta in-context learning) in-context leraning 능력이 in-context learning 자신을 통해 recursive 하게 발전되는 방법론인 meta in-context learning 을 소개한다.
(Idealized Domain) Regression task 와 two-armed bandit task 를 통해, meta-in-context learning 이 large language model 의 prior 의 expected task 에 adaptively reshape 한다.
(Experiment) real-world regression problem 과 다양한 NLP task 에 대해, 기존의 learning 알고리즘과 비교하여 경쟁적인 성능을 보인다.

Introduction

LLM 은 in-context learning 을 통해 additional training 없이도 대학수준의 수학 문제를 푼다던지, 어려운 reasoning task 를 해결할 수 있다. 이러한 것은 in-context learning (or few-shot prompting or few-shot learning) 이라 불리는 능력으로 알려져 있는데, downstream task 에 finetuning 을 진행하는 traditional 한 방식과는 차이를 보인다.

본 연구에서 저자는 whether the learning algorithm implemented through in-context learning can be improved through in-context learning itself 에 대한 질문을 한다. 이를 본 논문에서는 meta-in-context learning 이라고 칭한다.

세 개의 세팅에서, in-context learning 능력이 in-context learning 을 통해 발전된다는 evidence 를 찾는다. 우선 artifical domain 으로써, 하나의 regression task 와 하나의 two-armed bandit task 를 풀어본 결과, LLM 에게 sequential 하게 multiple learning problem 을 주는 것이 in-context learning 능력을 발전시키는 것을 확인할 수 있다. 이후, idealized domain 의 실험에서, meta-in-context learning 이 latent variable 의 prior 를 수정하여, 환경의 true statistics 에 유사하게 바뀐다는 것을 발견한다. 추가적으로, LLM 의 leraning strategy 자체를 reshaping 하는 것도 발견한다.

위의 그림이 meta-in-context learning 의 high-level overview 이다. Task 를 점진적으로 부여함으로써, 이전의 in-context learning 이 다음 in-context learning 에 영향을 주는 것이 meta-in-context learning 이다.

Experimental Setup

GPT-3 (text-davinci-002)
temperature 0 for deterministic response

Learning one-dimensional functions

우선 첫 실험 세팅으로, one-dimensional regression task 를 선택한다.

(1) Method

위의 예시처럼, 5 개의 task 에 대해서, T 개의 pair 들이 들어가고, 마지막 pair 의 y 값을 맞추는 task 이다. 모든 pair 는 x 와 y 의 noise $\epsilon$이 추가된 linear function (y = a*x + b + $\epsilon$) 이다.

(2) Results

우선, 기존의 preliminary simulation 에서, GPT-3 는 increasing positive function 에 strong bias 가 되어있음을 발견한다. 따라서, 저자들은 a ~ N(-2,1), b ~ N(-100,1) 로, negative slope 과 negative intercept 이 sample 되게 하였다. 실험 결과는 아래와 같다.

GPT-3 does in-context learning: meta-in-context learning 없이 우선, GPT-3 가 이 task 를 잘 푸는 지 확인한다. 뒤의 네 task 는 무시하고, 첫 번째 task 에 대한 실험만 진행한다. </span> Fiugre A 에서 파란색 solid line 에서 볼 수 있듯이, GPT-3 는 in-context learning 이 Bayseian linear regression (BLR) 보다 더 좋은 성능을 보여, 이 task 를 해결할 수 있음을 보인다.

GPT-3 does meta-in-context learning: meta-in-context learning 방식으로 점진적으로 5 번째 task 까지 가르쳤을 때 (Figure A 에서 solid vs dashed) 더 좋은 성능을 보인다. Figure B 에서 meta-in-context learning 방식이 task 증가에 따라 점진적으로 좋아짐을 볼 수 있다. Figure C 에서 한 task 안에서의 trial 증가와, task 자체의 증가에 따른 통계적 검증에서, GPT-3 가 in-context learning 과 meta-in-context learning 모두 할 능력이 있음을 보인다.

Meta-in-context learning is driven by adaptation of priors: GPT-3 가 meta-in-context learning 을 하는 동안 true environmental statistics 로 prior 를 바꾸는 것을 검증한다. 우선, GPT-3 의 temp 를 1로 바꾸고, sample 하게 한 다음, 그 sample 을 feedback 하여 다시 생성하게 하여 반복하였더니, 10,000 이상의 값을 내놓는 것을 확인한다. 이를 통해 GPT-3 의 regression 능력이 increasing positive function 으로 strong bias 되어 있는 것을 확인한다. 하지만, Figure D 에서 보듯이 meta-in-context learning 을 통해 2번 정도만 task 를 보여줘도, 순식간에 bias 를 negative 로 변경하는 것을 볼 수 있다. 이를 통해 in-context learning 이 in-context learning 자체를 발전시킨다고 볼 수 있다.

Meta-in-context learning is an emergent phenomenon: GPT-3 davinci 의 하위 모델인 text-ada, text-cabbage, text-curie 에서는 이 meta-in-context learning 능력을 볼 수 없다. 따라서, text-dvainci-002 정도의 크기가 되었을 때 새로 나타나는 emergent ability 이다.

Meta-in-context learning with non-linear functions: nen-linear function 인 quadratic function 에 대해서도 같은 경향성의 실험 결과를 보인다. (부록 참고하라는데, 부록이 없다…)

Experiments on two-armed bandit tasks

위의 regression task 와 완전히 같은 경향성을 보이며, 새로 가져갈 포인트는 없다. ※ 자세한 내용은 논문 참조.

Regression on real-world data and MMLU benchmark

60 개의 different real-world dataset 을 포함하는 multi-dimensional regression benchmark 에 대한 실험 결과이다. 실험 결과 역시, artifical regression task 와 경향성이 같다.

또한, Real-world natural language processing benchmark 로 MMLU 를 선택하여 실험을 진행하는데, 그 중 STEM supercategory task 에집중하여 실험을 진행한다. 실험 결과 meta-in-context learning 이 좋으며, 자세한 결과는 부록에 제시되어있지만, 부록을 찾을 수 없다.. ※ 자세한 내용은 논문 참조.

Conclusion and Discussion

Conclusion

We have demonstrated that LLMs can improve their in-context learning abilities via in-context learning itself, i.e., that they are capable of meta-in-context learning. Meta-in-context learning was not only able to overwrite an LLM’s priors but also changed its learning strategies, as demonstrated in two artificial domains. Finally, we applied our approach to two benchmarks. First, a real-world benchmark of regression tasks where we found that meta-in-context learning leads to algorithms that are competitive with standard learning algorithms. Then, we verified the applicability of our results in an NLP benchmark, providing further evidence of the versatility and effectiveness of our approach across diverse contexts.

Discussion
저자들의 시뮬레이션의 가장 중요한 단점은 소수의 관찰만으로 모두 학습 과제에 의존했다는 것이다. 이 한계는 주로 meta-in-context learning 의 빠른 prompt length 증가와 결합된 limited window context 의 현실적인 제약 때문이다. 허용된 context length 내에서 (아마 돈문제) 실험을 진행하기 위해 이러한 설계 선택을 해야 했을 것이다. 그럼에도 이런 simulation 은 meta-in-context learning 의 가능성을 설명하기에 충분하다고 생각한다. Longer context length 와 lower inference cost 에 대한 연구로 이어질 수 있다고 생각한다.

[EMNLP2023] HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

Mon, 08 Jan 2024 02:00:00 +0000

[pdf] [github]

Junyi Li ^1,3,4*, Xiaoxue Cheng ^1*, Wayne Xin Zhao ^1,4†, Jian-Yun Nie ³, and Ji-Rong Wen ^1,2,4
¹ Gaoling School of Artificial Intelligence, Renmin University of China ² School of Information, Renmin University of China ³ DIRO, Université de Montréal

Abstract

(Hallucination) ChatGPT 와 같은 Large Language Model (LLM) 은 soruce 와 대치되거나, factual knowledge 를 확인할 수 없는 hallucination 이 발생한다.
(HaluEval) Hallucination 의 what types of content 와 to which extent 을 측정하기 위해, hallucination recognize 하는 LLM 의 능력을 평가하는 large hallucinated sample 인 HaleuEval benchmark 를 만들었다.
(Challenges) ChatGPT 와 LLM 들이 hallucination recognizing 에 great challenge 가 있음을 보이며, external knowledge 를 제공하거나 addtional reasoning step 을 추가하는 것이 hallucination 을 줄일 수 있음을 보인다.

1. Introduction

Large Language Model 의 prominent capability 이면에 hallucination 문제가 존재함은 공공연한 사실이다. Hallucination 은 soruce 와 대치되거나, factual knowledge 를 확인할 수 없는 content 를 생성하는 것을 의미한다. 몇몇의 연구([1],[2],[3]) 에서 small LM 에 대한 hallucination 원인을 조사하기 위한 연구가 있었지만, what types of content and to which extent LLMs tend to hallucinate 에 대한 연구는 미흡하다.

이를 위해 이 논문에서는 Hallucination Evaluation (HaluEval) benchmark 를 소개한다. HaluEval 은 35,000 개의 hallucinated/normal sample 로 이뤄져있고, 이 중 5,000 개는 general user query 에 대한 chatGPT 의 response, 그리고 30,000 개는 (1) question answering, (2) knowledge-grounded dialogue, (3) text summarization 에 걸친 task-specific sample 이다.

위의 Figure 에 construction pipeline 을 볼 수 있다.

우선, general user query 에 대하여(Figure bottom), Alpca 의 instruction tuning dataset 에서, 5,000개의 query 를 추출한다. LLM 이 hallucination 을 더 잘 생성하게 하기 위해, chatGPT 에 query 에 대한 3 개의 response 를 생성하게 한 후, 이 3개의 response 의 simiarilty 가 가장 낮은 5,000개의 query 만을 사용한다. 이러한 것은 최근 SelfcheckGPT 에서 LLM 의 conflicting and diverged response 에서 hallucination 이 나타날 확률이 높다는 발견에 기반한다. 이후 Human annotator 로 하여금, hallucinated info 가 있는지, 그리고 있다면 corresponding span 을 mark 하도록 한다.

위의 Table 의 예시에서, human annotator 초록색으로 hallucinated span 을 marking 한 것을 볼 수 있다. 이 human-annotated query-response 를 통해 LLM 이 어떠한 type 의 content 를 hallucinate 하는지 분석할 수 있다.

그 다음, task-specific sample 에 대하여(Figure top), two-stage approach 가 사용된다. 첫 step 으로 existing task (e.g. HotpotQA) 에 대하여, ChatGPT 로 하여금, one-pass syle 과 conversational style 로 hallucinated sample 을 생성하게 한다. 두 style 로 나누는 것은 hallucinated sample 의 다양성을 위해서다. 두번째 step 으로, 가장 plausible 하고 difficult 한 hallucinated sample 을 고르기 위하여, ground-trtuth example 을 통해 filtering instruction 을 elaborate 하여, ChatGPT 로 하여금 sample 을 고르게한다. 이 Sample-then-Filtering 기법을 통해, specific task example 의 hallucinated counterpart 를 생성할 수 있다.

HaluEval benchmark 를 활용한 실험 을 통해 아래 세 가지 특징을 발견한다.

ChatGPT 는 unverifable information 를 날조하는 경향이 강하다.
LLM 들은 hallucination 을 알아차리는 것이 매우 어려우며, 특히 sample generation 에 사용된 ChatGPT 역시 그러하다.
LLM 의 부족한 hallucination recognizing 능력은 explicit knowledge 의 제공과, intermediate reasoning step 의 추가로 발전시킬 수 있다. Hallucinated sample 에 대한 contrastive learning 은 오히려 LLM 으로 하여금 더 confuse 하게 만들어, worse performance 를 보이게 한다.

2. The HaluEval Benchmark

HaluEval의 목표는 LLMs가 어떤 유형의 콘텐츠를 어느 정도로 (what types of content and to which extent) 환각하는 경향이 있는지 이해하는 것이므로, 이 벤치마크에는 다양한 sample-hallucinated counterpart 가 포함된다. Benchmark colleciton 은 Automatic generation 과 human annotation 두 방법을 통해 이뤄진다.

2.1. Automatic Generation

Automatic generation pipeline 의 목표는 (1) divserse hallucination sampling, (2) high-quality filtering 두 가지이다.

(1) Diverse Hallucination Sampling.
본 논문에서는 두 가지 Hallucination sampling method 를 활용한다. 첫 번째는 맨 처음 figure 에도 나와있듯이 one-pass 방법이고, 두 번째는 conversational 방법이다.

one-pass

Instruction 을 활용한 방법이다.

위의 Table2 에 나와있는 것과 같이, Instruction 을 활용하여 ChatGPT로 하여금 Hallucination sample 을 만들도록 한다.

converational

두 번째는 대화를 하듯, 차근차근 ChatGPT 로 하여금 hallucinated answer 를 생성하게 하는 것이다.

이렇듯 두 가지 방법을 통해 diverse 한 hallucination sample 을 생성할 수 있고, 이 sample 들은 추후에 filtered 될 것이다.

(2) Instruction Design.

위의 Table 2 에서 보이듯, one-pass instruction smapling 방법에서는 instruction design 이 중요하다. 저자들은 intention decription, hallucination pattern, hallucination demonstration 세 가지 중요한 파트를 나누어 ChatGPT 에게 제공한다.

처음 Intetion decription 에서는 ChatGPT 에게 role 을 부여하고, 생성의 objective 를 설명한다.
두 번째, hallucination pattern 은 hallucinated sample 의 type 과 quality 를 control 한다.
마지막, hallucination demonstration 에서는 few-shot exempler 를 제공한다.

저자들은 세 가지 task 에 대하여 hallucinated sample 을 생성한다.

Question answering : comprehension, factualness, specificity, inference 네 가지 type 의 hallucination pattern / HotpotQA
Knowledge grounded dialog : extrinsic-soft, extrinsichard, extrinsic-grouped 세 가지 type 의 hallucination pattern / OpenDialKG
Text summarization : factual, non-factual, intrinsic 세 가지 type 의 hallucination pattern / CNN/DailyMail

(3) High-quality Hallucination Filtering.

위의 방법대로 생성된 hallucination sample 들을 다시 ChatGPT 를 활용하여 filtering 한다.

위의 Table3 에서와 같이, Demonstration 에서는 ground-truth 를 고르게끔 exempler 를 주고, 실제 test example 에서는, hallucination sampling 들로만 이뤄진 candidate 중에서 고르게 하여 가장 plausible 하고 difficult hallucinated sample 을 filtering 한다. 이렇게 challenging 한 hallucinated sample 은 identify 하기 어렵기 때문에, LLM 들의 hallucination recognition evluation 에 사용된다.

이러한 sampling-then-filtering 기법을 통해 세 task 에 걸쳐 30,000 개의 hallucinated sample 을 생성한다.

2.2 Human Annotation

Autmoatic generation 과 별개로, human labeler 를 초청하여 ChatGPT response 가 hallucinated content 를 포함하는지 annotate 시킨다. Alpaca 의 52K instruction tuning datset 으로부터 user query 를 추출한 후, ChatGPT 에 생성시킨다. 이 때, ChatGPT 에 세 개 response 를 생성시키게 한 후, BERTScore 를 통해 가장 낮은 similarity 를 보이는 user query 로만 5,000개를 남긴다. 이후, Human labeler 들이 개입하여 “Yes or No” 의 대답과 함께, 어느 span 이 hallucination 인지 list 한다. 이 때 type 은 unverifiable, non-factual, irrelevant 세 가지이다. Annotator 는 영어에 능통한 사람들로 골랐으며, 각 query 당 세 명이며, max-voting 을 활용한다. kappa score 가 0.81 로 굉장히 높게 나왔다.

Human Annotation 예시는 아래의 Table 4 에서 볼 수 있다.

2.3 Benchmark Analysis and Usage

ChatGPT response 의 human annotation 결과 977개의 response(19.5%) 에 hallucination 이 담겨있었다.

위의 Figure 2 와 3 에서, automatic sampling 과 human annotation 에서, topic distribution 을 볼 수 있다.

이 benchmark 를 통해 연구자들은 세 가지 usage 를 가져갈 수 있다.

analyzing what types of content LLMs tend to generate
evaluating the ability of LLMs to recognize hallucinations in the generated samples
assessing whether the LLMs’ output contains hallucinations

3. Experiments

3.1 Experimental Setup

closed-source LLMs

GPT-3 (davinci)
InstructGPT (text-davinci-002/003)
ChatGPT (gpt-3.5-turbo)
Claude
Claude2

open-source LLMs

Alpaca (7B)
Vicuna (7B)
ChatGLM (7B)
Falcon (7B)
LLam2-chat (7B)

3.2. Results and Anlaysis

Hallucination Recognition

LLMs are still poor at identifying hallucination.

Summarization task 의 hallucination detection 에서, State-of-the-Art 인 ChatGPT 모델도 58.53% 를 보였고, random guess 인 50% 보다도 못한 성능을 보인 LLM 이 많다.

세 가지 task 에서 모두 Facutally correct 하지만 context 와 conflict 하는 hallucination pattern 에서 (P-I) failure 를 보인다

Improvement Strategies

knowledge retrieval 과 CoT 의 추가는 improvement 를 보였지만, sample contrast 는 오히려 성능을 나쁘게한다

Case Study

knowledge retrieval 가 hallucination 해결에 도움이 된다

4. Conclusion

We introduce HaluEval, a large-scale collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognizing hallucinations. To automatically generate large-scale samples, we propose a two-step approach, i.e., sampling-then-filtering. We first introduce two different sampling methods to generate diverse samples using instructions and then filter and select the difficult one. Besides, we invite qualified human labelers to annotate the hallucinations of ChatGPT responses given user queries. We find that, existing LLMs mostly fail to recognize the hallucinations in text and tend to generate hallucinated content. Finally, we suggest several strategies to help LLMs recognize hallucinations. Our benchmark can facilitate research in understanding what types of content and to which extent LLMs tend to hallucinate, ultimately paving the way for building more effective and reliable LLMs in the future.

[ICML2023] QASA: Advanced Question Answering on Scientific Articles

Tue, 02 Jan 2024 00:18:00 +0000

[pdf] [github]

Yoonjoo Lee ^1*, Kyungjae Lee ^2*, Sunghyun Park ², Dasol Hwang ², Jaehyeon Kim ², Hong-in Lee ³, Moontae Lee ^2,4
¹ KAIST (Work done at LG AI Research) ² LG AI Research ³ Yonsei University ⁴ University of Illinois Chicago. Correspondence to: Moontae Lee moontae.lee@lgresearch.ai.

Abstract

(Motivation) Intellectual thinking 의 필수불가결한 요소인 Reasoning 에 대해, Question Answering (QA) 이 하나의 방법일 수 있다. 그러나 현재 대부분의 QA 는 deeper understanding 없이 shallow QA 를 풀거나 짧은 factoid 를 푸는데 그친다.
(Associative Thinking) 복수의 연구에서, 인간은 연합 사고 (associative thinking) 를 통해 관련 지식의 조각들을 모은 후, grounding 한다.
(QASA) 저자들은 세 타입 : surface, testing, deep question 으로 구성된, AI/ML field scentific article 에 대한 1,798 개의 full stack reasoning dataset 인 QASA 를 제안한다.
(Experimental Results) QASA 를 활용하여 LLM 을 학습시켰을 때, InstructGPT 를 big margin 으로 outperform 한다.

Introduction

1974년 부터 이어진 인지과학 연구에서, 인간은 Dual process 로 reasoning 을 진행한다는 연구가 있었다. 첫 Step 은 연합 사고(associative thinking) 이고, 다음 step 은 logical reasoning 이다. QA 의 context 로 본다면, 첫 번째는 lexical matching 등을 통한 knowledge piece 를 모으는 것이고, 두 번째는 답변을 하기 위한 evidential retionale 을 찾는 과정일 것이다.

Reading Comprehension (RC) 은 다양한 QA 를 형상화한 하나의 reasoning task 이다. SQuAD, NewsQA, DROP, Natural Questions 등의 task 가 제안되었다. 이러한 것들이 모델의 성능을 많이 발전시키는데 큰 역할을 한 것은 맞지만, 대부분의 QA 가 짧은 factoid QA 로, “what”,”when”,”where”,”who” 등의 질문이 많고, “how”, “why” 는 거의 존재하지 않는다.

최근 Open-domain QA 에서는 Retrieve-then-read 의 방식으로 relevant document 를 추출하고, 정답을 도출해내는 two stage 방법을 표방한 task 를 푼다. 그러나, 역시 대부분 짧은 factoid QA 에 국한 되어 있거나, jointly both stage 를 활용한다기보다는, 첫 번째 stage 에 relying 하는 경우가 대부분 이다.

저자들의 Think-aloud Study 에서, scientific article 을 읽고 full-stack reasoning 을 하는데는, surface question 에 추가적으로 testing 과 deep question 이 필요로 하다는 것을 드러낸다. 특히, surface question 에 대한 답을 하기 위해서는 첫 번째와 두 번쨰 stage reasoning 이 필요로함이 드러난다. 이를 위해 저자들은, Question Answering on Scientific Articles (QASA) benchmark 를 제안한다. 이 dataset 은 reader 와 author 에게 단편적인 단락만 읽게 하는게 아니라, whole paper 를 읽은 뒤 question 을 생성하게 한다. 추가적으로, multi-faceted long-form answer 로 답변하게 한다.

QASA 의 예시는 위의 그림에서 볼 수 있다. QASA 는 AI/ML paper 에서 1,798 개의 QA 를 포함하고 있으며, 위의 question schema 를 통해 deep reasoning level question 을 39.4\% 정도 보유한다.

실험은 세 가지로 진행한다. 위에서 언급한 두 개의 stage 에 대한 각각의 평가인, associative selection, evidential rationale-generation 과 두 stage 를 모두 함께 잘하는 지 확인하는 systematic composition 이다. 각각의 subtask 를 pretrained LLM 에 모델링하였을 때, InstructGPT (text-davinci-003) 을 ROUGE-1 기준 5.11 point 나 앞섰다.

QASPER 는 QA for Academic Research Paper task 의 benchmark 로, question annotator 가 title 과 abstract 만 읽고 질문을 생성하였기 때문에, shallow question 으로 이뤄져있고, 70\% 정도의 질문이 yes/no 대답이나 small extractive word span 같은 간단한 대답으로 이뤄져있다.
ELI5 와 ASQA 는 Open-domain Long-form QA benchmark 이다. ELI5 는 reddit 기반의 데이터셋인데, 대부분이 supporting paragraph 가 존재하여 지식의 조각을 모아야하는 associative selection 을 요하지 않는다. ASQA 는 multi-passage 에 흩어진 sub-question 들을 모두 답할 수 있어야 한다. 그러나 이들은 associative selection 을 진행하지 않고, QASA 는 단순히 sub-question 에 대한 답만 하는 것이 아니라 evidential rationale generation 을 요한다.
AQuaMuSe 는 Query-focused Multi-Document Summarization(qMDS) task 의 benchmark 이다. qMDS 역시 multi-document 에서 정보를 추출하여 summarization 을 진행한다는 측면은 비슷하지만, 이들은 lexical matching 을 통해 automatic generated passage 를 사용해야 하지만 (annotation 이 없기 때문에), QASA 는 particular paragraph 에 대한 human-annotated evidence 가 align 되어있다.

Proposed Task

Scientific article 을 기반한 QA 라는 새로운 task 를 제안한다. Long research paper 전반에 걸쳐있는 multiple evidence 를 기반으로 question 에 답해야 하는 challenging task 이다. Q 와 A 그리고 paragraph 들의 모은 P 에 대하여, 하나의 방법은 Long-Former 등을 이용하여 한 번에 paragraph 를 처리하는 것이다. 그러나 QASA task 에서는 각 qeustion 이 paper 로 부터 rationale 을 연결하는 능력이 필요하다. 따라서 저자들은 (1) associative selection, (2) evidential rationalegeneration, (3) systematic composition 세 step 으로 문제를 design 한다.

Associative Selection
Given paragraph $P=(p_1, …, p_N)$ 에 대하여, answer 혹은 rationale 을 담고 있는 $\hat{P}=(\hat{p_1}, …, \hat{p_k})$ where $k«N$ 을 추출하는 sub-task 이다. 기존의 answerability classification 은 각 paragraph 가 정답을 담고 있느냐만 보았다면, QASA task 는 main answer 를 포함하여 multiple rationale 을 담고 있는지도 본다. 따라서 기존 answerability classification 의 super-task 라고 할 수 있다.

Evidential Rationale-Generation
Selected paragraph 기반으로 Long-form answer 의 기반이 되는 evidential rationale 을 생성하는 단계이다. Evidential rationale 은 (1) main answer 가 될 수 있고, (2) elaboration (i.e., sentences which elaborate on the main answer), (3) auxiliary information (i.e., background knowledge that could be helpful to the user) 이 될 수 있다. $\hat{P}=(\hat{p_1}, …, \hat{p_k})$ 에서 rationale set $(e_1, e_2,…, e_k)$ 를 추출한다.

Systematic Composition
Evidential rationale set $(e_1, e_2,…, e_k)$ 를 single context 로 하여 answer a 를 추출한다.

Building the QASA Dataset

(1) Question types
Paper 를 읽을 때 raise 되는 question 의 type 에 대한 고찰로, different levels of reasoning 에 대한 다양한 question 생성 schema 를 활용한다. Question type 은 아래와 같다.

Surface questions aim to verify and understand basic concepts in the content. The answer content is directly related to the words in the question and immediate context. This type includes verification, distinctive, concept completion questions.
Testing questions are focused on meaning-making and forming alignment with readers’ prior knowledge. These questions aim to find similar examples (example), quantify variables (quantification), and find meaning and make comparisons across concepts (comparison).
Deep questions ask about the connections among the concepts in the content and elicit advanced reasoning in logical, causal, or goal-oriented systems. This type includes causal antecedent, causal consequence, goal orientation, instrumental/procedural, rationale, expectation questions.

(2) Papers
open-aceess paper 의 machine readable full-text 의 collection 인 S2ORC 와 arXiv paper collection 를 활용한다. arXiv 에서는 cs.AI domain 을 활용하고, S2ORC 의 경우, 2015 년 이후 출판된 것 중 100개 인용 이상된 논문만 활용한다.

(3) Data collection
Reader session 과 Author session 으로 나눠서, reader 는 general reader 가 생성하는 QA, author 는 optimal annotated question 을 생성하는 역할로 나눴다. 두 session 모두 AI/CS 분야에서 일하는 annotator 를 섭외하였고, answering quality 를 위해 exam 도 보았다고 한다. ※ 자세한 question/answer 생성 방법은 논문 참고

(4) QASA Analysis
Representative examples

세 question 타입중 39.4\% 는 deep questions, 30.0\% 는 testing, 30.7\% 는 surface-level 이다. Deep question 중에는 instrumental sub-type 이 가장 많고, testing 에서는 comparison sub-type 이, surface 에서는 concept completion 이 가장 많다.

12\% 정도의 question 은 rationale 이 없는 unasnwerable question 이고, answerable question 중에는 평균적으로 1.67 개의 eveidential rationale 을 갖고 있다. 반절 정도의 답변들은 compose 를 위하여 annotator 를 필요로 하고, 나머지는 반절은 redundant rationale 을 simplifiying 하는 것만 요구한다.

QASA Approach

앞서 말한 세 가지 sub-task 인 (1) associative selection, (2) evidential rationalegeneration, (3) systematic composition 에 대해 실험을 진행한다. Associative selection 의 경우, search space 를 narrow down 하기 위해 pre-trained retrieval model 을 활용하여 question 에 대한 top-10 paragraph 를 뽑아서 whole paper 를 대체한다.

LLM 을 instruction tuning 하여 세 가지 sub-task 를 학습하며, sequential 하게 이전 step 의 output 이 다음 step 의 input 이 된다. Instruction prompt 는 아래와 같다.

모델은 T5, T0, FLAN-T5, GALACTICA 를 활용한다.

Training data 는 아래와 같다.

Experiment

Evaluation of Subtasks and Full-stack QA
Associative selection 의 경우, human annotated paragraph 를 positive 로, 나머지 top-10 paragraph 를 negative 로 하여 classification 을 진행하고, preicision, Recall, F1 score 로 평가한다. Rationale-generation 의 경우, gold positive paragraph 로 부터 evidential rationale generation 을 ROUGE 로 평가한다. Answer composition 의 경우, gold evidential rationale list 로 부터 answer 를 generation 하는 것으로 역시 ROUGE 로 평가한다.

Main Results
아래에서 세 개의 sub-task 와 Full-stack QA 에 대한 성능을 볼 수 있다.

Which pretrained LM is best?
Pretrained LM 중에서는 InstructGPT (175B) 가 가장 좋았다. 특히, rationale-generation task 에 대해 best performance 를 보였다. T5-based LM 중에서는 FLAN-T5 > T0 > T5 로 성능을 보여, downstream task 를 배운 수가 significant impact 를 보였다.

Which finetuned LM is best?
T0, T5, FLAN-T5 다 three subtask 에 대해서는 비슷비슷한 성능을 보였지만, ful-stack QA 에 대해서는 FLAN-T5 가 강점을 보인다. 심지어, InstructGPT 보다도 훨씬 좋은 성능을 보인다.

Does our task indeed need rationale-generation?
아래 table (full-stack QA result) 에서 w/o Rationale Gen 의 성능이 크게 저하 됨으로써, rationale generation 이 full-stack QA 를 위해 crucial step 임을 알 수 있다.

The failure of Galactica
Galactica 가 large-sclae research paper 를 배웠음에도 불구하고, 저조한 성능을 보인다. 특히 ‘yes’ 나 ‘no’ 로만 대답하는 성향이 강해, ROUGE score 가 매우 낮은 것을 볼 수 있다.

Human Evaluation

ASQA 논문을 기반으로 Human evaluation 을 진행한다. QASA 의 Full-stack 방법으로 QA 를 푸는 것이, instructGPT 에게 question 을 던져주고 answer 를 얻는 것보다 Groundness, Completeness, Specificity 측면에서 모두 좋은 모습을 보였다. 반면 Fluency 측면에서는, InstructGPT가 좋은 모습을 보였다.

Conclusion

Conventional information search requires a series of nontrivial efforts from retrieving and reranking relevant information to manually reading and restructuring the selected information. Due to growing volumes of scientific papers and professional articles, the traditional process is no longer feasible, urging an innovation in knowledge processing and reasoning. Generative QA would be a promising alternative, but it lacks appropriate benchmark and principled methodologies that are focused on human intellectual capabilities: full-stack reasoning.

In this paper, we propose the QASA: a novel benchmark dataset and a computational approach. Our QASA benchmark guides expert readers and paper authors to generate various types of questions and answers from surface to testing and deep levels. Our QASA approach decomposes the full-stack reasoning process into three reasoning subtasks: associative selection, evidential rationale-generation, and systematic composition. By modeling each subtask by pretrained LM, we show that FLAN-T5 finetuned on public and synthetic data could serve as the best test-bed for our QASA, proposing a new horizon of full-stack cognitive reasoning on scientific articles such as research papers and manuscripts.

A Survey of Large Language Models (4)

Sun, 24 Dec 2023 02:45:00 +0000

[pdf] [github]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie and Ji-Rong Wen

A Survey of Large Language Models (2) 에 이어서…

A Survey of Large Language Models (3)

Sun, 24 Dec 2023 02:45:00 +0000

[pdf] [github]

A Survey of Large Language Models (2) 에 이어서…

A Survey of Large Language Models (4) 에서 계속…

A Survey of Large Language Models (2)

Sun, 24 Dec 2023 02:45:00 +0000

[pdf] [github]

A Survey of Large Language Models (1) 에 이어서…

4. Pre-training

LLM 을 pretrain 하는데는 효율적인 알고리즘, model architecture, optimization technique 등이 모두 중요하다. 이번 섹션에서는 LLM 을 pretrain 하기 위한 세 가지 요소인 (1) data collection, (2) model architecture, (3) training technique 를 각각 살펴본다.

4.1. Data Collection and Preparation

LLM 을 학습하기 위해 높은 퀄리티의 dataset 을 확보하는 것은 매우 중요하다. 이번 section 에서는 data source, preprocessing methods, 그리고 pre-training data 가 LLM 에 미치는 영향 의 세 가지 측면을 살펴본다.

(1) Data Source

대부분의 LLM 은 위에 보이는 그림처럼 여러 data source 의 mixture 를 pretraining dataset 으로 활용한다. Dataset 들은 크게 두 가지로 나눌 수 있는데, 하나는 general text data, 다른 하나는 specialized text data 이다. General data 는 대부분의 LLM 에서 활용하는 dataset 으로 webpage, books, converational text 등이 속하며, 크기가 크고 (large) 다양하며 (diverse), 접근이 용이하기 때문에, generalization ability 를 높이기 위해 필요하다. Specialized data 에는 multilingual data, scientific data, code 와 같은 특정한 task-solving capability 를 부여하기 위해 사용된다.

※ 각 dataset 에 대한 자세한 설명은 논문 참조.

(2) Data Preprocessing

Pre-training corpus 를 준비한 이후에는 noisy, redundant, irrelevant, toxic data 를 제거하는 전처리가 필수적이다. 최근 Data-Juicer 라는 여러 전처리 방법을 담고 있는전처리 tool 이 release 되었다. 일반적인 전처리 pipeline 은 위의 그림과 같다.

Quality Filtering

Quality filtering 에는 일반적으로, classifier-based 방법과 heuristic-based 방법이 있는데, 기존에 많이 사용하던 classifier-based 방법은 방언이나 구어체 등을 제거할 확률이 높아 bias 를 높이는 경향이 있다. 따라서 최근 BLOOM 이나 GOPHER 등에서는 heuristic 방법을 쓰는데, 그 종류에는 Language based filtering, Metric based filtering, Statistic based filtering, Kyeowrd based Filtering 등이 있다.

De-duplication

최근 한 연구에서, 문장이 반복되는 duplication 문제가 training 을 unstable 하게 만들고 성능을 떨어뜨린다는 주장을 하였다. 이에 repeated word 를 가지는 low quality 문장을 제거하고 (sentence-level), n-gram 등을 기반으로 너무 많이 겹치는 documnet 를 제거하며 (document-level), dataset contimination 문제 해결을 위해 training set 과 eval set 의 overlap 을 해결한다 (set-level).

Privacy Reduction

흔히, PII 라고 부르는 personally identifiable information 를 pretraining corpus 에서 제거해야 한다. 한 가지 방법으로는 rule-based 로 name, address, phone number 등을 지우는 것이다.

Tokenization

이제 Tokenization 을 진행하면 된다. 최근에는 subword-level 기반의 tokenization 이 주로 사용되고, byte pair encoding (BPE), Wordpiece tokenization, unigram tokenization 등이 사용된다. BPE 는 multilingual setting 에서 장점을 보이며, GPT-2, BART, LLaMA 등에서 사용한다. Wordpiece 는 Google 의 subword tokenization 알고리즘으로, 처음에는 voice search system 을 위해 고안되었으나, 이후 MT 모델, 그리고 BERT 에서 사용되었다. Wordpiece 는 BPE 와 기본적으로 유사한 방법이지만, merge 하는 방법에서 조금의 차이점을 보인다. 마지막으로 Unigram tokenization 은 EM 알고리즘의 일종으로, old LM 을 활용하여 큰 vocab 에서 하나씩 제거해 나가며 dictionary 를 완성한 후, 다시 re-estimate 하여 vocab 을 만들고를 반복한다. T5, mBART 등에서 사용되었다.

OPT 와 GPT-3 가 GPT-2 tokenizer 를 사용한 것처럼, 기존에 있는 tokenizer 를 사용하는 것도 좋은 방법 중에 하나이지만, 모델이 학습하는 pre-training corpus 에 맞춰 specially designed tokenization 기법을 적용하는 것은 큰 도움이 된다. 따라서, 최근에는 BPE 와 unigram 기법을 합친 Sentence Piece library 를 활용하는 등 customized tokenizer 를 활용하는 경향성이 높다. 단, transfer learning 을 할 때 이러한 customized tokenizer 는 조심해야한다. LLaMA 의 경우, pretraining 시에 BPE 를 활용하기 때문에, non-english dataset 에 대해서는 fine-tuning 에 어려움이 있을 수 있다.

(3) Data Scheduling

Data scheduling 에는 두 가지가 중요하다 : data mixture, data curriculum.

Data mixutre

Data 를 섞을 때는 proportion 이 중요하다. 보통 upsampling, downampling 기법등을 이용한다. 최근 여러 연구에서 하나의 domain 의 data 를 너무 많이 배우는 것은 좋지 못한 성능을 낸다는 것을 검증하였다. 또, 몇몇의 연구에서는 heuristic 하게 proportion 을 결정하지 않고, model 을 활용하여 optimize 하는 방법을 제안하였다. 간단한 예로, downstream task 에 맞춰 그 task 에 맞는 pretraining corpus 의 비율을 증가시키는 것들이 있으나, 실용적이지는 못하다.

Data curriculum

Basic skill 을 배운 이후 traget skill 을 배우는 것이 효과적이라는 것이 몇몇 연구([1],[2])에서 검증되었다. 이에 따라 dataset 을 pretraining 할 때, 어떠한 것을 먼저 배울지 그 curriculum 을 정하는 것도 중요하다. 보통 target skill 은 coding, Mathematics, Long context modeling 세 가지에 대해 curriculum 을 많이 적용한다.

4.2. Architecture

이 섹션에서는 LLM 의 아키텍쳐 디자인 : mainstream architecture, pre-training objective, detail configuration 등을 살펴본다.

(1) Typical Architectures

LLM 의 backbone 은 Transformer 가 de-facto architecture 이다. 보통 크게 세 가지 major type 으로 나눈다 : encoder-decoder 구조, causal decoder 구조, prefix decoder 구조.

Encdoer-decoder Architecture : T5, BART 등
Causal Decoder Architecture : GPT-Series, OPT, BLOOM, Gopher 등 대부분의 LLM 들
Prefix Decoder Architecture : U-PaLM, GLM-130B 등

(2) Detailed Configuration

대부분 LLM 의 기반인 Transformer 의 네 가지 configuration 인 (1)Normalization method, (2) Normalization position, (3)Activation Functions, (4) Position embeddings 를 다룬다.

추가적으로, Attention mechanism 에 대해서는 (1) Full attention, (2) Sparse attention, (3) Multi-query/grouped-query attention, (4) FlahsAttention, (5) PagedAttention 등을 다룬다.

※ 각 configuration 및 method 에 대한 자세한 설명은 논문 참조.

(3) Pre-training Tasks
LLM 은 대부분 Langague Modeling 과 Denoising Autoencoding 을 학습한다. ※ 관련 내용은 너무 유명하므로 생략, 논문 참조.

(4) Long Context Modeling
최근 PDF proecssing 이나 story writing 과 같은 long context modeling capacity 를 increasing 하기 위한 요구가 많다. GPT-4 는 128K context window 를 지원하고, Claude 2.1 (Anthropic 社) 은 200K context window 를 활용한다. Long context modeling 능력을 위해서는 대표적으로 두 가지 기법이 활용된다.

Scaling Position Embeddings

T5 Bias, ALiBi, xPos, NoPE 같은 position embedding 기법들 대부분이 maximum training length 안에서의 학습만으로 충분한 generalization 효과를 본다. 이를 extrapolation capability 라고 하는데, mainstream position embedding 중 하나인 Rotary Position Embedding (RoPE) 의 경우, 이 extrapolation capa 가 없다. 이에 아래 방법들을 통해 RoPE 를 longer text 에 scale 할 수 있다.

1) Direct model fine-tuning : LLM 을 단순하게 더 긴 text 에 fine-tuning 하는 방법이다. 보통 multi-stage approach (e.g. 2K->8K-> 32K) 를 활용한다. 매우 느리다는 단점이 있다.

2) Position interpolation : Long context 의 position index 들을 downcale 하여, original context window 크기로 맞추는 방법이다. 단순히 position index 들에 L/L’ (original context length L, target context lenth L’) 을 곱해주는데, 실험 결과 효과적으로, 그리고 효율적으로 Long context 로 extend 할 수 있지만, 짧은 텍스트에 오히려 adverse impact 가 있다.

3) Position truncation : out-of-distribution rotation angle 문제를 해결하기 위해, long context 의 longer relative position 을 truncate 해버리는 방법이다. ReRoPE 나 LeakyReRoPe 에서는 pre-difeined window length 를 정의한 후, window 안은 유지한채 그 바깥은 truncate 하거나 maximum context length 로 interpolate 하는 방법을 소개한다. 이 방법으로 local position relationships 을 유지하면서 extrapolation capacity 를 얻는 것을 확인한다. 단점은, attention matrix 를 두 번 계산하기 때문에, 추가적인 cost 가 든다.

4) Base modification : 이미 고정된 maximum training length (e.g. 4096 in LLaMA2) 에서, basis angle 인 $\theta$ 를 줄이면 longer text 처리가 가능하다.

Adapting Context Windows

LLM 이 학습 과정에서 고정된 context window 를 갖고 있기 때문에, long sequence 처리가 힘들다. 이 한계점을 극복하기 위해, 아래의 방법들이 고안되었다.

1) Parallel context window : Fusion-in-Decoder (FID) 와 같이, divdied-and-conquer 기술을 활용하여 input text 를 처리한다. 그러나 이러한 방법은 different segment 들을 구별할 수 없기 때문에, 성능에 제한이 있다.

2) Λ-shaped context window : 최근 연구들에서 LLM 은 attention weight 을 시작과 끝에 더 크게 allocate 하는 “lost in the middle” 현상을 보인다. 이 발견에 따라, LM-Infinite, StreamingLLM 등은 “Λ-shaped” attention mask 방법을 적용하여, scope 를 정한 후 그 바깥의 token 은 버린다. 이 방법은 long context 에의 확장성은 좋지만, long-range dependency 를 모델링 하는데 어려움이 있고 성능이 좋지 못하다.

3) External memory : Transformer 의 attention pattern 의 대부분이 small subset of token 에서 capture 된다는 발견을 바탕으로, past key 들을 external memory 에 넣은 후, k-NN search 를 통 k 개의 most relevant token 을 찾아 generation 에 활용한다.

(5) Decoding Strategy
LLM 이 학습된 이후에는 효과적인 generation 을 위한 decoding strategy 을 잘 선택해야할 필요가 있다. Greedy search, Beam search (+Length Penalty), Random Sampling, Top-k sampling, Top-p sampling (neclues sampling) 기법 등이 존재한다. 또한, LLM 의 decoding 방식은 memory wall 등의 문제로 효율적이지 못한데, 이를 해결하기 위해, Fast-Decoding 등이 고안되었다.

4.3 Model Training

LLM 을 학습하기 위한 중요한 setting 과 trick 들을 알아본다.

(1) Optimization Setting

Batch Training

Training stability 와 throughput 을 위해 batch size 는 어느 정도 크게 가져간다. (2,048 examples or 4M tokens) GPT-3 와 PaLM 에서는 dynamic 하게 batch size 를 키우는 새로운 기법을 소개한다. GPT-3 의 경우 32K token 부터 시작하여 3.2M 까지 증가한다. 이러한 dynamic schedule 이 LLM 학습에 안정성을 부여한다는 Empirical result 가 존재한다.

Optimizer

Adam 과 AdamW 가 LLM 학습에 많이 사용된다. Hyper-parameter 로는 $\beta_1=0.9, \beta_2=0.95, \epsilon=10^{-8}$ 를 사용한다. T5 와 PaLM 에서는 Adafactor 가 사용되었다.

Stabilizing the Trainig

LLM 학습시에 mode collapse 와 같은 training instability issue 가 발생하기 쉽다. 기존에 이러한 학습 안정성을 위해 gradient clipping 이나 weight decay 등이 제안되었지만, LLM 에서는 여전히 training loss spike 가 튀는 경우가 빈번하다. 이러한 학습을 위해서 PaLM 과 OPT 의 경우, spike 가 튀기 직전의 checkpoint 에서 restart 하는 나이브한 방법을 택하며, 문제가 되는 data 는 skip 한다. GLM 의 경우, spike 를 발생시키는 abnormal gradient 를 shirnk 한다.

(2) Scalable Training Techniques

LLM 학습 시에 두 가지 큰 issue 가 있다: 하나는 training throughput 이 너무 크다는 것이고, 다른 하나는 GPU memory 에 loading 할 때 모델이 크다는 것이다. 이를 해결하기 위한 기법들을 소개한다.

3D parallelism

3D parallelism 은 흔히 사용되는 세 가지 병렬 처리 방식인 data parallelism, pipeline parallelism, tensor parallelism 을 모두 사용하는 것이다. Data parallelism 은 흔히 쓰이는 방식이므로 생략하고, pipeline 의 경우 consecutive layer 를 GPU 에 분산 배치하여 학습시키는 것이다. 이 때, GPU 가 다른 GPU 의 연산을 기다려야 하는 bubbles overhead 문제가 발생하는데, 이를 해결 하기 위해, GPipe 나 PipeDream 등의 기법이 개발되었다. Tensor parallelism 의 경우, matrix tensor 를 submatrix 로 split 하여 다른 GPU 에 올리는 것이다. Megatron-LM 등의 오픈 소스에서도 쓸 수 있다.

위의 기법들을 practice 에 적용할 때는 jointly 적용이 된다. 예를 들어, BLOOM 의 경우 384 개의 A100 이 사용되었으며, 8-way data parallelism, 4-way tensor parallelism, 12-way pipeline parallelism 이 사용되었다.

ZeRO

DeepSpeed library 에 존재하는 ZeRO 기법은 data parallelism 시 모든 data 를 모든 GPU 가 다 갖고 있지 않고, 일부만 가지고 있다가 필요시에 retrieve 하는 방식이다. Pytorch 에서는 ZeRO 와 유사한 기법으로 FSDP 가 구현되어있다.

Mixed Precision Training

32-bit float 연산을 16-bit (FP16), 더 나아가 8-bit (FP8) 로 줄인다. 그러나 일반적인 방법은 성능 저하를 불러 올 수 있기 때문에, 최근에는 Brain Floating Point (BF16) 이라는 것이 개발되었고, FP16 에 비해 더 많은 exponent bits 를 할당하여 FP16 보다 좋은 성능을 보였다.

5. ADAPTATION OF LLMS

Pre-training 만으로도 LLM 은 굉장한 퍼포먼스를 보이고, 높은 일반화 성능을 보여준다. 그러나, LLM 의 능력은 specific goal 을 달성하기 위해 충분히 adapted 될 수 있다. 보통 이러한 과정은 human values or preferences 와 align 하고자 함이다. 크게 두 가지 (1) Instruction Tuning 과 (2) Alignment Tuning 을 살펴볼 예정이고, 추가적으로 Param 관점과 Memory 관점에서 효율적인 Adaptation 방법을 소개한다.

5.1. Instruction Tuning

Instruction Tuning 은 자연어의 형태로 formatted 된 instance 의 collection 으로 LLM 을 fine-tuning 하는 방법이다. 기존의 SFT(Supervised Fine-tuning)나 multi-task prompt training 과 연관이 깊다. Instrution tuning 을 통해 unseen task 로의 generalization 성능이 비약적으로 증가하며, multilingual setting 에서도 효과적이다. 최근 한 연구 에서 instruction tuning 의 systemtic overview 를 한 것이 있으니 관심있으면 살펴보길 바란다.

5.1.1. Formatted Instance Construction

Instruction tuning 을 위해서는 Instruction-formatted instance 을 모아야 한다. Instruction-formatted instance 는 Instruction 이라고 불리는 task decription, optional input, corresponding output 등으로 이뤄진다. 앞의 소개글 에서의 Section 3.3 에서 Instruction tuning 을 위한 instance resource 들을 볼 수 있다. 여기서는 세 가지 formatted instance 를 구성하는 방법론을 다룬다.

(1) Formatting NLP Task Datasets
첫 번째는 text summarization, text classification, translation 등의 다양한 NLP task 에서 모으는 Dataset 이다. 이렇게 모인 dataset 들은, (보통 human-written인) 자연어 task decription 과 함께 multi-task training 으로 학습된다. 위의 그림의 (a)에서 human-written instruction 인 “Please answer this question” 과 함께 QA task 를 푸는 것을 볼 수 있다.

이 때 instruction (task decription)이 fine-tuning 에서 매우 중요한 역할을 한다. 같은 task 를 학습하더라도, instruction 이 없이 학습할 경우, generalization 성능이 매우 떨어진다(dramatic drop). Instruction 을 잘 생성하기 위해, PrompotSource 같은 크라우드소싱 플랫폼도 제안되고 있다.

(2) Formatting Daily Chat Data
NLP training instance 가 풍부함에도 real-world scenario 에는 mismatch 하는 경향이 있다. 이를 해결하기 위해 InstructGPT 의 경우, OpenAI API 를 활용하는 user 의 query 를 활용하여, 이 query 에 대한 answer 를 인간이 직접 쓰게 하여 instance 를 만들었다. 이렇게 collected user query - human written answer pair 를 하나의 instance 로 하여 학습데이터셋으로 활용한다.

(3) Formatting Synthetic Data.
LLM 을 활용하여 생성한 synthetic data 를 instruction tuning dataset 으로 활용하기도 한다. 이러한 Self-instruct method 는 보통 175개 정도의 instance 를 initial 로 하여 수많은 데이터셋을 만들어 낼 수 있다. 여기서 중요한 것은 quality 와 diversity 를 위한 filtering 과정이다. Machine 이 generate 하는 dataset 이기 때문에 무엇보다 이 filtering 과정이 필수불가결 하다.

하지만, Self-Instruct method 는 여전히 simplistic or lacking the diversity 의 문제점이 존재한다. 이를 해결하기 위해, WizardLM 의 경우, in-depth, in-breadth evloving 방법을 통해 diversity 를 증가시키는 방법을 제안하였으며, Self-Align 의 경우, multiple human-aligned principle 을 filtering criteria 로 활용하는 방법을 제안하였다.

5.1.2. Instruction Tuning Strategies

Instruction tuning 은 pre-training 과 비교하면 훨씬 효율적으로 학습될 수 있다. 언뜻보면 supervised setting 이라는 점에서, pre-training startegy 와 큰 차이가 없을 것 같지만, instruction tuning 은 보통 sentence-to-sentecne loss 를 활용하고 (pre-training 은 LM loss/classification loss), smaller batch size 와 smaller learning rate 를 갖는다. 이 외에도 다음의 네 가지 중요한 특징이 있다.

(1) Balancing the Data Distribution
Instruction tuning 은 보통 multi-task 로 학습하기 때문에, 여러 task 의 proportion 을 맞추는 것이 매우 중요하다. 가장 많이 사용되는 방법 중 하나는 examples-proportional mixing strategy 라는 방법으로, 모든 데이터셋을 combine 한 후, equally sampling 하는 것이다. 추가적으로, FLAN 이나 P3 같은 high-quality collection 의 sampling 비율을 높이는 방법 또한 고려될만 하다. 이런 경우, maximum cap 을 도입하여, 너무 많은 sampling 비율을 가져가지 않게 조절하는 것이 좋다.

(2) Combining Instruction Tuning and Pre-Training
Instruction tuning 의 학습 안정성을 위해, OPT-IML 의 경우, pre-training dataset 을 instruction tuning 중에도 함께 사용한다. 이는 model tuning 과정에서의 regularization 역할을 할 수 있다. 이러한 관점에서 몇몇의 연구에서는 pre-training 과 instruction-tuning 의 경계를 나누지 않고, pre-training 을 한 이후, mixture of pre-training and insturction tuning dataset 을 학습하는 것이 좋다고 주장한다. GLM-130B, Galactica 등에서 역시 이러한 방법으로 좋은 instruction tuning 결과를 얻었다.

(3) Multi-stage Instruction Tuning
NLP instruction instance dataset 이 daily chat dataset 보다 훨씬 수가 많다. Carefully 두 종류의 데이터셋들을 mixing 하는 것도 중요하지만, multi-stage instruction tuning strategy 를 고려할 수 있다. 먼저, 크기가 큰 NLP instruction instance dataset 들을 학습한 이후, daily chat dataset 을 학습하는 것이다. Capacity forgetting issue 방지를 위해, second stage 에 NLP instance 를 같이 배우는 것도 좋은 방법 중 하나이다.

(4) Efficient training for multi-turn chat data
Multi-turn chat 을 한 번에 배우는 것보다, multiple QA pair 로 쪼개서 학습하는 것이 효과적일 수 있다. Vicuna 의 경우, whole conversation 을 LLM 에 학습시키지만, loss mask 를 도입하여 chatbot 의 response 에만 loss 를 계산하도록 하였다. 이 방법은 compute cost 를 significantly 줄일 수 있다.

5.1.3. The Effect of Instruction Tuning

Instruction tuning 의 효과는 크게 세 가지가 있다. (1) Performance Improvement
Instruction tuning 을 진행한 smaller model 이 그렇지 않은 larger model 보다 훨씬 성능이 좋다. Pre-training 보다 훨씬 값싸게 그 이상의 효과를 볼 수 있는 것이다.

(2) Task Generalization
Instruction tuning 은 Pre-training 보다 unseen task 에 대한 generalization 성능이 뛰어나다 (애초에 이것을 위해서 instruction tuning 을 진행한다). 또한, repetitive generation or complementing the input without accomplishing a certain task 같은 LLM 의 고질병을 경감시키는 효과도 있다. 특히나 multi-lingual setting 으로의 확장은 instruction tuning 이 매우 필수적이다.

(3) Domain Specialization
Medicine, Law, finance 같은 전문가 domain 의 영역에서는 pre-training dataset 만으로는 매우 성능이 빈약하다. Pre-training dataset 들이 대부분 NLP 전반적인 내용을 다루고 있기 때문에, 이러한 domain-specific dataset 을 학습하는 것이 필요한데, instruction tuning 을 통해서 진행할 수 있다.

5.1.4. Empirical Analysis for Instruction Tuning

위에서 말한 것들을 실험적으로 검증을 해 본다.

(1) Instruction Dataset
Instruction dataset 으로는 앞서 말한 세 종류에 대해, Task-specific instruction 은 FLAN-T5 datset 을, Daily chat instruction 은 ShareGPT dataset 을, _Synthetic instruction_은 Self-Instruct 52K 를 활용하며, 이 중 FLAN-T5 의 크기는 매우 크므로 80,000개 sample 로 제한한다.

(2) Improvement strategies
Human written instruction 을 확보하는 것이 매우 좋지만, large scale 로 얻는 것은 어렵기 때문에, LLM 을 활용하여 insruction 을 synthetic 하게 얻을 수 있다. 그러나 이러한 방법은 too simple 하거나 too difficult 하여 좋지 않은 경우가 많다. 다음의 네 가지 방법은 실험에서 사용한, synthetic insturction 의 quality 를 증가시키는 방법들이다.

Enhancing the instruction complextiy : Wizard-LM처럼 서서히 complexity level 을 증가시키는 방법
Increasing the topic diversity : instruction 속의 topic diversity 를 증가시키는 방법 ; synthetic instance 의 경우 ChatGPT 를 활용하여 rewrite 한다.
Scaling the instruction number
Balancing the instruction difficulty : LLAMA-7B Perplexity score 를 기반으로 difficulty 를 측정하여 balancing 한다.

(3) Results and ANalysis

각각 LLaMA 7B 와 13B에 대해 위의 Row 는 mixing instruction dataset 의 효과, 그리고 아래는 improvement strategy 의 효과를 볼 수 있다. 결과를 자세히 분석하면 아래의 분석들을 얻을 수 있다.

Task-formatted instructions are more proper for the QA setting, but may not be useful for the chat setting.
A mixture of different kinds of instructions are helpful to improve the comprehensive abilities of LLMs.
Enhancing the complexity and diversity of instructions leads to an improved model performance.
Simply increasing the number of instructions may not be that useful, and balancing the difficulty is not always helpful.
A larger model scale leads to a better instruction following performance

5.1.5. Instruction Tuning Suggestions

LLM 의 instruction tuning 을 위한 기본제원은 위의 표에서 확인할 수 있다. 또한, LLM 을 처음 instruction tuning 한다면, Alpaca repository 의 code 를 follow 하는 것을 추천한다. Computational resource 가 갖춰져 있따면, LoRA 를 활용하여 parameter-effieicent tuning 을 할 수 있다.

5.2. Alignment Tuning

5.2.1. Background and Criteria for Alignment

(1) background
LLM 은 넓은 영역에 걸쳐 매우 좋은 성능을 보여주지만, 여러 가지 side effect 를 보인다: 잘못된 정보 생성(Hallucination; fabricating false information), and 잘못되거나 bias 되 표현 생성(producing harmful, misleading, and biased expressions) LLM 의 학습은 Language Modeling 인 Next word prediction 으로 학습되기 때문에, human values 나 human preference 를 반영하기 어렵다. 이를 위해 Alignment tuning 을 진행하는데, pre-training 이나 instruction tuning 과 다르게, 다양한 criteria 를 고려해야 한다(e.g. helpfulness, honesty and harmlessness).

(2) Alignment Criteria
Alignment Tuning 을 위한 다양한 Criteria 가 있을 수 있지만, 이 논문에서는 앞서 소개한 (instructGPT 에서 활용한) 3H value 인 Helpfulness, Honesty, Harmlessness 에 대해서 소개한다.

Helpfulness

Helpfulness 는 user 의 intent 에 맞게, task 를 solving 하는데 있어서 further clarification 을 제공할 수 있는지의 여부를 의미한다. 하지만 helpful behavior 에 대한 정의가 어렵기 때문에, 달성하기 어려운 criteria 중 하나이다.

Honesty

Uncertainty 가 높을 경우, 이상한 대답을 하지말고 모른다고 대답을 해야하는 경우이다(“know unknowns”). 한 연구에 따르면, 나머지 두 criteria 에 비해 비교적으로 henosty 가 더 objective 한 criterion 으로, human efforts 에 덜 의존적으로 학습될 수 있다.

Harmalessnes

Model 이 offensive 한 문장을 생성하지 않도록 하는 criterion 이다. Model 이 dangerous action 을 요구 받는다면, LLM 은 반드시 정중히 거절할 수 있어야 한다(politely refuse).

이 criteria 들은 모두 주관적이고, 따라서 optimization objective 를 formulation 하는 것이 어렵다. 가장 많이 사용되는 방법은 red teaming 으로, manual 혹은 automated 방법으로 adversary 하게 LLM 을 공격하여 그러한 output 들을 방지하도록 update 시키는 방법이 있다.

5.2.2. Collecting Human Feedback

위에서 봤듯이 Human values 에 대한 criteria 가 주관적이기 때문에, High-quality human feedback 을 필요로 한다.

(1) Human Labeler Selection
좋은 Feedback 을 위해 좋은 Labeler 를 구하는 것은 매우 중요하다. 따라서 보통 영어에 매우 능통하고, 교육의 수준이 높은 human labeler 를 구한다. 예를 들어, Sparrow 의 경우, 영국 영어를 잘 구사하며 대학 이상의 학력을 가진 labeler 를 활용하였다.

그럼에도 불구하고, LLM 개발자들과 human labeler 사이의 mismatch 가 LLM 이 unexpected output 을 생상하여 low-quality human feedback 으로 이어지는 경우가 빈번하다. 이를 위해, InstructGPT 의 경우, human labeler 와 researcher 사이의 agreement 를 통해 filtering 과정을 진행한다. Researcher 가 조금의 양의 label 을 먼저하고, 이후 human labeler 와 agreement 를 측정한다.

(2) Human Feedback Collection
Human Feedback 을 모으는 경로는 크게 아래의 세 가지 방법이 있다.

Ranking-based approach

이전 연구에서들에서,human labeler 가 model-generated output 을 평가할 때 fine-grained alginment criteria 를 고려하지 않았다. 그러나 다양한 labeler 들은 가장 적합한 candidate 의 선택에 대해 다양한 의견을 가질 수 있으며, 기존의 방법은 선택되지 않은 sample 을 무시하기 때문에 부정확하거나 불완전한 human feedback 으로 이어질 수 있다. 이 문제를 해결하기 위해 후속 연구들에서는 _Elo 평점 시스템_을 도입하여 후보 결과물을 비교함으로써 선호도 ranking 을 도출한다. 이 Ranking 은 모델이, 다른 것보다 더 선호하도록 이끄는 training signal 로 작용하여 더 신뢰할 수 있고 안전한 결과물을 유도한다.

Question-based apprach

LLM researcher 들의 question 에 대한 labeler 의 대답으로 human feedback 을 colleciton 할 수 있다.

Rule-based approach

많은 연구에서는 더 자세한 human feedback 을 제공하기 위해 Rule-based 방법을 사용한다. Sparrow 는 labeler 가 가장 좋다고 생각하는 응답을 선택하는 것뿐만 아니라 human alignment를 확인하기 위해 일련의 rule 을 사용한다. 이렇게 하면 두 종류의 human feedback 데이터를 얻을 수 있다: (1)response preference feedback 은 output 의 quality 을 짝지어 비교함으로써 얻어지며, (2) rule violation feeback 은 labeler 의 평가를 수집하여 생성된 output 이 rule 을 얼마나 위반했는지를 나타내는 점수를 얻을 수 있다.

5.2.3. Reinforcement Learning from Human Feedback (RLHF)

이러한 human feedback 을 학습하기 위해서는 강화 학습 (RL : Reinforcement Learning) 이 사용된다. 대표적인 알고리즘은 Proximal Policy Optimization (PPO) 알고리즘이다.

(1) RLHF System
RLHF 는 주로 세가지 요소로 구성된다: pre-trained LM to be aligned, reward model learning form human feedback, RL algorithm.

우선, Pre-trained LM 은 보통 generative model 이며, GPT-3 를 통해 InstructGPT (175B) 를 만들고, Gopher 를 통해 GopherCite model (280B)을 만든다. 두 번째 reward model 은 LM 이 생성한 text 에 대한 human preference 를 반영한 signal 이다. 보통 align 되는 LM 보다 훨씬 적은 parameter 의 모델을 사용한다. GPT-3 의 경우 6B 모델을 reward model 을, Gopher 의 경우 7B 모델을 사용한다. 마지막으로 RL algorithm 은 거의 Proximal Policy Optimization (PPO) 알고리즘을 사용한다.

(2) Key Steps for RLHF

RLHF 는 세 가지 step 으로 이뤄진다. 위의 그림에 세 가지 step 이 나와있다.

Supervised fine-tuning

LM 이 기대되는 행동(desired behavior)를 할 수 있게끔, 우선 fine-tuning 을 시킨다. 이 때 input 은 instruction(prompt) 과 함께 주어지며, desired output 이 output 으로 주어진다. 이 input-output 은 human labeler 에 의해 작성되며, task 의 다양성을 보장하는게 보편적이다. 예를들어, InstructGPT 는 “List five ideas for how to regain enthusiasm for my career” 를 input 으로 하여, “Open QA, brainstorming, chatting, and rewriting.” 등을 output 으로 하여 학습하였다.

Reward model training

두 번째 step 은 Human feedback data 를 활용하여 Reward Model 를 학습시키는 것이다. 우선 여러 prompt 에 대해 LM 이 output 을 생성하게한 후, human labeler 가 input-output pair 에 대해, human preference 를 ranking 으로 매긴다. 이후 reward model 이 ranking 을 맞추게끔 학습이 진행된다. 최근 한 연구 에서는 AI feedback 으로 Reward model 을 학습하는 RLAIF 를 제안하기도 하였다. Human feedback 이 harmless 를 줄이는 대신 helfpulness 를 덜 줄이는 evasion problem 문제가 발생하는 반면, AI feedback 은 그러한 문제가 덜하다.

RL fine-tuning

마지막으로, LM 과 reward model 을 활용하여 RL 알고리즘 (PPO) 를 통해 RL fine-tuning 을 진행한다. Pre-trained LM 이 policy 로, vocab 이 action space, 현재까지 생성된 token sequence 가 state 가 되며, reward model 에 의해 reward 를 부여 받는다.

(3) Practical Strategies for RLHF
Alignment tuning 에 있어, RLHF 가 promising 하지만 실제 구현은 쉽지 않다. 이 절에서는 RLHF 구현을 위한 practical 한 trick 들을 소개한다.

Effective reward model training

InstructGPT 가 6B 의 작은 모델을 reward model 로 사용하였지만, 후속 연구들에서 reward model 을 LLM 과 같은 크기 혹은 이상의 크기로 할 경우, alignment tuning 의 효과가 더 좋음을 검증하였다. 예를 들어, LLaMa2 의 경우, pretrained model checkpoint 가 reward model 의 initialization 으로 사용되었다. 이렇게 할 경우, reward model 과 LM 이 같은 pre-training knowledge 를 공유하기 때문에, information mismatch 를 줄일 수 있다고 한다. 그러나 큰 크기의 reward model 은 overfitting 의 염려가 있으므로, reward model 학습 시 input-output pair ranking 에 더해, 추가적으로 LM loss 까지 구성하여 regularizer 로 활용한다. 그리고 single reward model 에서 세 개(혹은 그이상)의 alignment criteria 를 다 만족시키기는 어려울 수 있기 때문에, 각각 criteria 에 상응하는 multiple reward model 을 학습시키는 것도 하나의 좋은 방법이다.

Effective RL training

RL 학습이 매우 불안정하기 때문에, RL 학습 전에 supervised finetuing 이 잘 되는 것이 매우 중요하다. 하나의 방법은 RL convergence 전에 LLM 이 prompt 의 best output (best-of-N) 을 생성하게 finetuning 하는 것이다. Given prompt 에 대해, LLM 이 sampling 기법을 통해 N 개의 output 을 생성하면, 이 중에서 reward model 이 best candidate 을 고르는 것이다.

5.2.4. Alignment without RLHF

RLHF 가 alignment tuning 에서 좋은 모습을 보이지만, limitation 들도 있다. 우선, RLHF 는 aligned 되기 위한 LM 외에 다른 LM 들도 필요로 하고, reward model 과 reference model 을 동시에 필요로 한다. 게다가, PPO 알고리즘은 complext 하며 hyper-param 에 굉장히 민감하다. 따라서, 그 대체로 RL 없이 학습하는 non-RL supervised fine-tuning 이 제안된다.

NOn-RL supervised learning 의 basic idea 는 high-quality alignment dataset 을 supervised 방식으로 학습하는 것이다. 이 것의 전제는 alignment dataset 안에, unsafe behavior 를 피하기 위한 golden rule 이 담겨있다고 가정하는 것이다. 따라서 이를 위해서는 alignment dataset 의 구성과, fine-tuning loss 의 design 에 대해서 생각해봐야 한다.

첫 번째 alignment datset 의 construction 에 대해서는, human feed back data 를 refine 하거나 reward model 이 high-rated 로 평가한 것들을 모으는 방법이 있다. 두 번째, fine-tuning loss 의 경우, instruction tuning loss 와 유사하게 가져가되, ranking response 혹은 contrasting instruction-respose pair 등의 auxiliary loss 를 추가한다.

5.2.5. Remarks on SFT and RLHF

마지막으로 간단하게, LLM 의 학습 방법인 SFT 와 RLHF 에 대해서 connection 과 difference 를 기반으로 살펴보자.

(1) Overall Comparison with RL Formulation
RLHF 는 앞서 설명한 대로, reward model 을 먼저 학습한 이후, LLM 을 학습시킨다. 반면, SFT 는 teacher-forcing 방법으로, LLM 이 expert 를 흉내내는 behavior cloning 을 학습하게 한다. SFT 는 token-level loss 로써 “local” optimization way 라면, RLHF 는 text-level loss 로써 “global” optimization way 이다.

(2) Pros and Cons of SFT
Pre-training 과 instruction tuning 에 사용되는 SFT 는 LLM 에 “능력을 부여하는” 역할을 한다. 그러나 SFT 는 LLM 에게 new ability 를 “inject” 하는 것은 불가능하다. 따라서, SFT 에 non-endogenous ability 를 stimulate 시키는 것은 매우 어려운 일이다.

그리고 SFT 의 학습만으로는 hallucination 문제가 많이 발생한다. 특히나 큰 모델을 작은 모델로 distillation 할 때 hallucination 문제가 더욱 발생한다.

또한, behavior cloning 방식의 학습이기 때문에, different annotator 의 writing style, quality 등이 SFT 학습에 영향을 줄 수 있다. 따라서, SFT 에 있어서 training datset 의 quantity 보다는 quality 가 매우 중요하다.

(3) Pros and Cons of RLHF
RLHF 는 앞서 언급했듯이 human preference 와 human value 를 LLM 에 반영시키는데 큰 역할을 하였다. 이러한 이유로는 RHLF는 (1) SFT 에서의 annotator 들의 discrepancy 를 크게 경감시켜줄 수 있고, (2) preference annotation 이 writing annotation 보다는 훨씬 쉽기 때문에 annotation quality 가 높다는 점이다. 또한, self-generated response 를 contrastive 하게 학습하기 때문에, external data 를 imitate 하려다 발생하는 hallucination 문제를 경감할 수 있다.

그러나 RLHF 는 역시 RL 의 알고리즘이므로, RL 의 고질병인, sample inefficiency 문제와 training instability 문제가 발생할 수 있다. 따라서 복잡한 iterative optimization 과정을 요구한다.

5.3. Parameter-Efficient Model Adaptation

이번 섹션에서는 몇 가지 paramter-efficient fine-tuning 기법을 소개하고, 이 방법들을 통해 fine-tuned 된 LLM 들을 소개한다.

5.3.1. Parameter-Efficient Fine-Tuning Methods

Transformer language model 을 위한 네 가지 parameter-efficient fine-tuning 기법들을 소개한다.

(1) Adapter Tuning
Adapter 라고 불리는 방법은 transformer self-attention layer 전에, 작은 dim 으로 projection 했다가 돌아오는 bottleneck network (adapter) 을 추가한 뒤, self-attention layer 등 original LM parameter 는 frozen 하고 adapter 만 학습하는 방법이다.

(2) Prefix Tuning
MLP (Multi-Layer perceptron) 을 활용한 reparameterization trick 을 통해, trainable continous vector 인 “prefix” 를 학습하는 방법이다. 이 ‘prefix’는 task-specific 한 virtual token embedding 으로 활약한다. 학습 이후 MLP 는 버려지고, 해당 task 학습을 할 때, 학습된 prefix vector 가 붙어서 학습된다.

(3) Prompt Tuning
LM 은 frozen 하고, input 앞단의 prompt projection 만을 학습하는 prompt tuning 이다. 대표적인 예로 P-tuning 이 있다.

(4) Low-Rank Adaptation (LoRA)
Matrix update 과정에서 low-rank approximation 을 이용한다. W ← W + ∆W 의 weight matrix update 에서 W 는 frozen 하고, ∆W 를 low-rank approximation 으로 쪼갠 다음 (∆W= A*B^T), A 와 B matrix 만을 학습한다. 이를 통해 memory 와 storage usage 를 매우 크게 줄일 수 있다는 장점이 있다.

5.3.2. Parameter-Efficient Fine-Tuning on LLMs

많은 Param-efficient fine-tuning method 중 LoRA 와 Adapter 가 가장 많이 open-source LLM 에 적용이 된다. Alpaca-LoRA 는 Alpaca (7B LLaMA 에 52K Alpaca finetuning set 을 학습한 모델)에 LoRA 를 적용한 모델이고, LLaMA-Adapter 는 LLama-Adapter 도 제안되었다. 최근 연구에서 GPT-J, BLOOM, LLaMA 7B 모델들에 adapter tuning 과 LoRA 를 적용하여, GPT-3.5 와 비교하였을 때, 어려운 task 에서는 성능 감소가 있었지만, simple task 에서는 유사한 정도의 성능을 보임을 검증하였다. 이를 바탕으로 LoRA 가 fewer trainable param 을 가지고도 좋은 성능을 낼 수 있는 finetuning method 임을 알 수 있다. 그러나 현존하는 대부분의 PEFT 방법은 7B 정도의 작은 사이즈의 Pre-trained LM 에 적용이 되어 실험되었기 때문에, large-sized language model 로의 efficient tuning 효과에 대한 조사가 더 필요하다.

5.4. Memory-Efficient Model Adaptation

LLM 모델이 매우 크기 때문에 inference footprint 등의 문제로 deploy 단계에서 문제점이 많다. 이를 위해 large-sized LLM 의 memory 크기를 줄여 inference latency 를 줄이고자 하는 연구가 많다.

5.4.1. Background for Quantization

이 장에서는 메모리를 줄이는 기법 중 하나인 Quantization 기법에 대한 배경을 설명한다. Neural network compression 에서, float 를 8-bit int형으로 바꾸는 INT8 Quantization 기법이 제안되었다. 수식적으로는 $x_q = R(x/S)-Z$ 이고, $S$ 와 $Z$ 는 scaling factor (clipping range 를 결정) 와 zero-point factor (symmetric/assymmtric 을 결정) 이고, $R$ 은 rounding operation 이다. 이후 dequantization 과정은 $\tilde{x} = S(x_q + Z)$ 이다.

5.4.2. Quantization Methods for LLMs

Quantization approach 는 크게 두 가지로 나뉜다: Quantizatino-aware training (QAT) 과 Post-Training Quantization (PTQ). 전자는 full model retraining 을 요구하고, 후자는 medel retraining 을 요구하지 않는다. LLM 은 매우 큰 수의 parameter 를 갖고 있기 때문에, QAT 보다는 PTQ 가 선호된다.

Post-Training Quantization (PTQ)
아래에 여러 PTQ 방법론들을 소개한다.

Mixed-precision decomposition : LLM.int8 논문 에서 관찰되었듯이, model size 가 6.7B 이상이 되면 hidden activation 에서 extreme large value 가 나타난다. 이 것을 emergence of outlier 라고 부르는데, 이 outlier 는 특정 feature dimension 에 분포되어 있기 때문에, LLM.int8 에서는 이 outlier feature dimension 을 나머지 dimension 과 분리한 후, 각각 16-bit floating 과 8-bit integer 로 계산한다.
Fine-grained quantization : 전체 tensor 에 모두 quantization 을 적용하는 coarse-grained quantization 기법은 reconstruction 결과가 좋지 못하다. 이에 ZeroQuant 논문에서는 token-wise (정확히 말하면 group-wise) 을 진행한다.
Balancing the quantization difficulty : Activation 에 비해 weight 들이 quantized 되기 쉽다는 것을 고려하여, SmoothQuant 는 scaling transformation 을 incorporate 하여 둘 사이의 quantization difficulty 를 줄이는 연구를 진행하였다.

5.4.3. Empirical Analysis and Findings

INT8/ INT4 와 같은 어떠한 레벨의 precision 을 언제 적용하는지를 아는 것은 매우 중요하다.

Important Findings from Existing Work.
LLM.int8, GPTQA, QLoRA, GLM 과 같은 최근 연구에서 발견된 매우 중요한 technical detail 들에 대해 알아본다.

INT8 weight quantization can often yield very good results on LLMs, while the performance of lower precision weight quantization depends on specific methods : LLM 은 quantization 에 꽤 robust 하기 때문에, 실제로 작은 모델을 쓰는 것보다, 큰 모델에 quantization 이 적용된 모델을 쓰는 것이 더욱 효과가 좋다. (4-bit 60GB LLM 이 8-bit 30GB LLM 보다 좋은 성능을 보인다) 특히 in-context learning, COT, instruction following 등의 emergent capability 들이 4-bit weight quantization 에도 그 능력이 유지됨이 확인된다.
Activations are more difficult to be quantized than weights : 앞서 말했듯 6.7B 이상의 LLM 에서는 outlier 가 존재하여 reconstruction 이 힘들다. 이를 극복하기 위해 mixed-precision / fine-grained quantization / difficulty migration 등이 고안되었다. 따라서 LLM 보다 더 적은 모델이 quantization 에 robust 하다.
Efficient fine-tuning enhanced quantization is a good option to enhance the performance of quantized LLMs : QLoRA 와 같이 quantization 과 함께 PEFT 를 적용하는 것은 좋은 성능을 이끌어낼 수 있는 좋은 방법이다.

Empirical Analysis on Quantization Experiments

8-bit / 4-bit weight quantization 에서 모두 16-bit model 과 비슷한 성능을 보인다.
따라서, 실질적으로 4-bit weight quantization 을 먼저 고려하여 memory reduction 을 해보는 것이 추천된다.

5.4.4. Open-source Libraries and Quantized LLMs

이 절에서는 Quantization library 들을 소개한다.

Quantization Libraries

Bistandbytes
GPTQ-for-LLaMA
AutoGPTQ : HuggingFace PEFT library 와 연계 가능
llama.cpp

Quantized LLMs
HuggingFace 를 통해, BLOOM, GPT-J, ChatGLM 등의 qunatized 버전의 LLM 을 사용할 수 있다. GPTQ 가 대표적으로 많이 사용되는 quantized LLM 이다. (Quantized LLaMA, OPT 버전들보다 많이 사용된다)

A Survey of Large Language Models (3) 에서 계속…

A Survey of Large Language Models (1)

Sun, 24 Dec 2023 01:45:00 +0000

[pdf] [github]

이 글은 Large Language Model (LLM) 의 survey 논문으로 cited paper 의 link 는 생략한다.

Abstract

LLM:Large Langauge Model 은 tens or hundreds of billions of params 를 가지는 언어모델로, in-context learning 등의 몇몇 special ability 를 보인다는 측면에서 PLM:Pre-trained Langauge Model 과 차이를 보인다.
이 연구에서는 최신 LLM 연구를 pre-training, adaptation tuning, utilization, capacity evaluation 네 가지 측면에서 조사한 survey 논문이다.

1. Introduction

Machine 에게 있어 인간이 comminucate 하는 것과 유사하게, read, write 하는 기술을 갖게하는 것은 오랜 목표이다.

Langauge Modeling (LM) 은 machine 에게 언어 지능을 가르치는 major 한 방법으로, 다음 단어의 확률을 예측하도록 generative likelihood 를 model 하도록 학습시킨다. LM 의 연구분야는 시대에 따라 크게 네 가지로 나뉜다.

Statistical Language Models (SLM) : n-gram 기반으로 markov assumption 으로 word prediction 을 진행하는 통계적 방식이다. curse of dimension 문제가 발생한다.
Neural Language Models (NLM) : MLP:multi-layer perceptron 이나 RNN:Recurrent Neural Network 등을 활용하여 word prob 을 예측한다. NLP 연구의 매우 중요한 impact 를 가져온 연구들이다.
Pre-trained Language Models (PLM) : “pre-training” and “fine-tuning” 패러다임. ELMO, BERT, GPT-2, BART 등.
Large Langauge Models (LLM) : Scaling Law 논문을 기반으로 PLM 의 성능이 scale 이 커짐에 따라 좋아진다는 연구가 있었다. 175-B GPT-3, 540-B PaLM 등이 그것인데, 이들은 성능이 그저 좋아지는 점을 넘어, complex task 를 푸는 special ability 를 보인다 (in-context learning 등)

특히나, 위의 그림에서 보는 것처럼 chatGPT 의 등장 이후 LLM 연구가 매우 활발하다. LLM 연구는 기존의 text data 를 model 하고 generate 하는 연구와 다르게, complext task solving 을 하는데 치중되어 있다. (From langauge modeling to task sloving)

(1) Differences between LLM and PLM
LLM 은 PLM 과 비교하여 크게 아래 세 가지의 차이점을 보인다.

LLM 은 PLM 에 비교하여 전례없는 powerful 한 성능을 보인다. (특히 complex task 에서)
GPT-4 API 처럼 prompting interface 를 통해 인간이 AI 시스템을 사용하는데 혁명을 불러왔다.
압도적인 크기로 인해, research 와 engineering 의 영역을 무너뜨렸다.

(2) LLM 의 단점
그러나 이러한 LLM 의 underlying principle 은 여전히 explored 되지 않았다. LLM 이 PLM 보다 압도적인 성능을 언제부터, 어떻게 내어놓는지에 대한 연구가 더 필요하다. 그리고, LLM 은 압도적인 크기로 인해 몇몇 industry 에서만 활용 가능하며, data collection and cleaning 등의 중요한 training detail 은 공개되지 않는다. 마지막으로 LLM 은 toxic, fictitious, harmful content 를 생성한다.

(3) LLM 연구의 필요성
따라서 이러한 문제를 극복하기 위하여 LLM 에 대한 더욱 깊은 연구가 필요하다. 이 survey 에서는 네 가지 관점에서 연구들을 정리한다.

pre-training (how to pretrain a capable LLM)
adaptation (how to effectively adapt pre-trained LLMs for better use)
utilization (how to use LLMs for solving various downstream tasks)
capability evaluation (how to evaluate the abilities of LLMs and existing empirical findings)

이후 추가적으로, some useful prompt design, LLM application in specific-domain 등을 다룬다.

2. Overview

2.1. Background for LLMs

(1) Scaling Laws for LLMs
LLM 은 기본적으로 Transformer 를 기반으로 하지만, model size, data size, total computation cost 등에서 매우 압도적으로 크다. 여러 연구에서 scaling 이 model 의 capacity 를 키운다는 것을 발견했다. 여기서는 두 가지 scaling law 를 소개한다.

KM scaling law

2020 년 OpenAI 팀의 Kaplan et al. 은 model size (N), dataset size (D), amount of training compute (C) 에 대해, 다음 세 가지 scaling law 를 보였다.

이들은, model performance 가 세 가지 factor 에 strong dependence 를 갖는 것을 보였다. 이후, follow-up 연구에서 OpenAI 팀은 LM loss 를 두 가지로 구분하였는데, 하나는 irreducible loss(the entropy of the trud data distribution) 이고, 다른 하나는 reducible loss(an estimate of the KL divergence between the true and model distributions) 이다.

Chinchilla scaling law

Google DeepMind team 의 Hoffmann et al. 은 다른 형태의 scaling law 를 제안하였다. 여러 모델 사이즈와 여러 데이터 사이즈를 통해 아래의 식을 경험적으로 찾아낸다.

Compute budget 이 커질 때, KM scaling law 는 모델 사이즈를 키우는 것을 더 favor 하는 반면, Chinchilla scaling law 는 모델 사이즈와 데이터셋 사이즈 모두 equal scale 로 올려야 한다고 주장한다.

(2) Discussion on Scaling Laws
Scaling law 를 두 가지 측면에서 분석할 수 있다.

Predictable Scaling.

Scaling law 를 기반으로, smaller model 을 통해 larger model 의 performance 를 estimate 하는 것이 feasible 하다고 볼 수 있다. 너무 큰 모델은 그 성능을 측정하는 것조차 버거울 수 있는데, small model 로 부터 얻은 경험이 적용이 될 수 있다는 점은 매우 큰 장점이 된다. 그리고, LLM 을 학습시킬 때, training spike 같은 abnormal performance 가 있을 수 있는데, scaling law 는 가 LLM 의 training status 를 monitor 하는데 employ 될 수 있다는 장점이 된다. 또한, 모델의 크기가 커짐에 따라, LLM 학습을 위한 public dataset 이 “exhausted” 될 수 있으며, data scarcity 해결을 위한 data augmentation 기법이 매우 필요함을 의미하기도 한다.

Task-level Predictability.

LLM 의 scaling law 는 LM loss 에 치중되어 있다. 그러나 실제로 LM loss 의 감소가 모델의 performance 증가와 같은 말을 의미한다고 볼 수는 없다. GPT-4 는 coding ability 를 포함한 몇몇 capability 의 경우, scaling law 에 따라 정확하게 예측할 수 있다고 한다. 그러나 많은 경우 inverse scaling 이라 불리는 현상이 있으며, 이는 LM loss 가 감소함에도 task performacne 는 오히려 나빠지는 경우이다. 그리고 in-context learning 같이, scaling law 로는 예측할 수 없는 능력도 있다.

(3) Emergent Abilities of LLMs
“Emergent ability” 는 smaller model 에는 나타나지 않지만, large model 에 갑자기 나타난 능력을 의미한다. 특히 이러한 능력은 copmlex task 에서 나타난다.

In-context learning.

In-context learning (ICL) 은 GPT-3 에서 처음 제안된 개념으로, 추가적인 training 이나 gradient update 없이 주어진 instruction 에 따라 문장을 완성하는 능력을 말한다. ICL 은 task 에 따라 천차만별이며, arithmetic task 의 경우, 13B 정도의 GPT-3 에서도 잘하지만, Persian QA task 는 175-B 도 잘하지 못한다.

Instruction Following (Instruction Tuning).

흔히 Instruction Tuning 이라고 부르는, 자연어 decription 을 통한 multi-task fine-tuning 을 통해, LLM 은 instruction form 을 이용하여 explicit example 없이도 unseen task 를 잘 풀어낸다. 대표적인 예시로, LaMDA-PT 는 68B 에서 unseen task 를 잘 해결하며, PaLM 의 경우 62B 부터 MMLU, BBH, TyDiQA, MGSM 같은 eval benchmark 에서 좋은 성능을 보인다.

Step-by-step reasoning (Chain-of-Thought ; CoT).

CoT prompting 을 이용한 grounding 능력은 100B 이상은 되어야 효과적이다.

(4) How Emergent Abilities Relate to Scaling Laws
Scaling Law 와 Emergent ability 는 전혀 상반된 결과이다. 하나는 continous improvement 에 대한 내용이며, 하나는 sharp performance leap 에 관한 내용이다. 이에 대한 연구는 더욱 필요하지만, emergent ability 는 인간이 언어를 배우는 것과 유사하다고 한다. 인간은 몇몇 단어만 말하다가 ‘갑자기 어느순간’ discontinuous 하게 문장을 구사하게 되는데, 이러한 것이 LLM 이 emergent ability 능력을 가지는 것과 유사하다고 본다.

(5) Key Techniques for LLMs
LLM 이 general and capable learner 가 되게 하는 성공요소는 아래와 같다.

Scaling.

앞서 언급했듯이, LLM 은 Transformer 모델을 scaling 한 것이다. GPT-3 가 175B 에서, PaLM 이 540B 에서 scaling limit 을 경험했듯이, compute budget 이 정해진 상황에서는 scaling limit 이 있다. 이런 상황에서 Scaling Law 는 compute-efficent allocation 을 수행하기 위해 더 고려되어야 한다. Chinchilla 는 Gopher 과 비교하여, 같은 compute budget 조건에서, 모델 사이즈 대신 더 많은 training token 을 써서, 더 좋은 성능을 보인다. 추가적으로, data scaling 은 careful cleaning process 가 필요한데, pre-training data 의 quality 는 모델에 매우 큰 영향을 미친다.

Training.

LLM 은 크기가 매우 크기 때문에 distributed training 알고리즘을 요한다. 이러한 병렬적인 학습을 위해 여러 optimization framework 들이 등장했는데, DeepSpeed 나 Megatron-LM 등이 그 예시이다. 또한, training loss spike 극복을 위한 restarting 기법이나, mixed precision training 같은 기법들도 고려되어야 한다. GPT-4 는 독자적인 infrastructure 와 optimization method 를 제안하여, 작은 모델로 큰 모델의 성능을 예측할 수 있는 방법을 제안하였다.

Ability eliciting.

LLM 이 학습된 이후에는 instruction tuning 이나 CoT prompting 같은 techinical approach 를 통해 LLM 의 능력을 이끌어내는 (eliciting) 것이 중요하다.

Alignment tuning.

LLM 은 toxic, biased, harmful content 를 생성해낼 수 있다. InstructGPT 에서 제안되었듯이, helpful, honest, harmless (3h) 세 human value 에 LLM 이 align 되어야 한다. InstrucGPT 에서는 RLHF(Reinforcement Learning with Human Feedback) 을 통해 이를 해결하고자 하였다.

Tools manipulation.

LLM 은 parameter 내에 정보를 배우는 형식이기 때문에, pre-training data 안에 능력이 한정될 수 밖에 없고, out-date information 을 생성할 수 밖에 없다. 이를 해결하기 위해, external tool 을 활용하여 LLM 의 결점을 극복하고자 하는 시도가 있다. 즉 LLM 에 “eyes and ears” 를 달아주는 것이다.

2.2. Technical Evolution for GPT-series Models

ChatGPT 의 발전으로, GPT Series 는 LLM 연구를 리드하게 되었다. GPT Series 는 decoder-only model 로 (1) next word 를 정확하게 예측할 수 있다는 점, (2) LM 의 scaling up이 가능하다는 점 이 key point 이다. 아래 그림에서 GPT Series 의 발전사를 볼 수 있다.

(1) Early Explorations.
OpenAI 는 (Google DeepMind 의) Transformer 를 기반으로 GPT-1, GPT-2 를 만들었다. GPT-1 은 GPT(Generative Pre-Training) Series 의 시작이다 (2017). Decoder-only Model 의 근간이다. GPT-2 는 GPT-1 을 1.5B scale 로 증가시킨 것으로, large web data 인 WebText 를 학습시켰다. GPT-2 는 BERT 와 같은 세대로, Transfer learning 에 용이하게, unsupervised LM 학습을 더 치중했다(sought to). 이들의 논문에서 소개하는 아래 문구를 바탕으로 GPT Series 는 next word prediction 의 unsupervised LM 에 더 치중하게 된다.

“Since the (task-specific) supervised objective is the same
as the unsupervised (language modeling) objective but only
evaluated on a subset of the sequence, the global minimum
of the unsupervised objective is also the global minimum
of the supervised objective (for various tasks)”

(2) Capacity Leap.
GPT-2 가 “unsupervised multitask learner”를 표방함에도 불구하고, 많은 supervised fine-tuning 을 통해 SOTA 를 달성하였다(특히 Dialog 분야에서). GPT-3 는 model size 를 175B 까지 늘렸다 (2020). 이 논문에 처음으로 In-context Learning (ICL) 의 개념이 등장한다. 이 논문에서 언급하지는 않지만, GPT-3 는 scaling law 를 뛰어넘는(transcend) emergent ability 를 보인다. PLM 으로부터 LLM 이 등장하는 순간이다.

(3) Capacity Enhancement.
GPT-3 는 이제 LLM 의 근간(base)이 되었다. OpenAI 는 두 가지 방향에서 GPT-3 모델을 further improving 한다.

Training on code data.

GPT-3 의 최대 약점은 reasoning ability 로 특히 code generation, sovling math problem 에 약했다. 이를 극복하기 위해 2021년 7월 OpenAI 는 Github code 를 대량으로 학습한 Codex 모델을 소개한다. 이는 code 생성과 수학 문제에 탁월한 능력을 보였으며, 이후 contrastive learning 을 통해 더욱 강력한 성능을 얻을 수 있었다(2022년 1월). 실제로 이 code-based GPT model(code-davinci-002) 은 이후 GPT-3.5 모델의 base 가 된다. 이 발견을 통해 code data 를 training 하는 것이 reasoning ability 를 크게 증가시킨다는 것을 검증할 수 있다.

Human alignment.

2017 년전에 이미 OpenAI 는 human preference 를 RL 로 학습하는 방법에 대하여 블로그를 통해 소개한 적이 있다. 이에 더불어, 2017 년 PPO (Proximal Policy Optimization) 이라는 RL 알고리즘이 소개되면서 본격적으로 Human preference 를 학습하는 모델이 등장한다. 2020년 1월, GPT-2 는 PPO 알고리즘을 통해 Human preference 를 학습하여 성능을 올렸다. 2022년 1월, OpenAI는 이러한 연구를 바탕으로 GPT-3 에 RLHF 를 적용시켜 instructGPT 를 소개한다.

이 두 가지 기법을 바탕으로 발전한 GPT-3 는 GPT-3.5 로 불리게 된다.

(4) The Milestones of Language Models.
위의 exploration effort 를 바탕으로 OpenAI 는 ChatGPT 와 GPT-4 라는 두 가지 거대한 milestone 을 달성한다.

ChatGPT

2022년 11월 30일, GPT-3.5 를 기반으로하는 conversation model chatGPT 가 release 된다. ChatGPT 소개 블로그글의 “sibling model to instructGPT” 처럼, instructGPT 와 유사한 방식으로 학습된 모델이지만, dialogue 에 specially optimze 되었다. ChatGPT 는 인간과 소통하는 능력, 수학적 문제 해결 능력, multi-turn dialog 의 context 를 정확히 tracing 하는 능력, human value (3h) 를 잘 align 하는 능력을 모두 갖추었다. So far, it seems to be the ever most powerful chatbot in the AI history. ChatGPT 의 등장은 AI 연구의 양지화의 시작이다.

GPT-4

2023년 3월, multimodal signal 을 입력으로 처리하는 GPT-4 가 등장한다. 4-series 의 이름이 붙은 이유는 model capacity 가 이전의 세대들보다 압도적으로 좋기 때문이다. 심지어, 6개월 간의 RLHF 를 통해 human value alignment 도 훌륭하다(red teaming 등). 추가적으로, OpenAI 는 predictable scaling 이라는 메커니즘을 처음 소개하는데, 이는 training 도중 small portion 으로 final performance 를 예측할 수 있는 메커니즘이다.

GPT-4V, GPT-4 turbo, and beyond.

2023년 9월 GPT-4 를 기반으로, OpenAI 는 vision 측면에서 safety 를 더 키운 GPT-4V 를 release 한다(vision 입력에서의 risk 를 mitigation). GPT-4V 는 강력한 vision 능력을 기반으로, powerful multmiodla learning system 로써의 강력한 potential 을 지닌다. 2023년 11월, OpenAI 는 GPT-4 Turbo 를 공개하였고, 이는 성능이 GPT-4 보다 좋고, knowledge source 를 2023년 4월까지로 확장하였으며, context window 를 더 길게 볼 수 있고(128K token), API 를 통해 사용자 편의성도 제공한다.

이러한 발전에도 LLM 은 여전히 hallucination 의 문제에 취약하다. (이는 7장에서 더 자세히 다룬다)

3. Resources of LLMs

LLM 을 develop 하거나 reproduce 하는 것은 쉽지 않다. 따라서 LLM 을 잘 “활용”하는 것이 중요하다. 이 section 에서는, publicly available LLM resource 에 대해 요약한다.

3.1. Publicly Available Model Checkpoints or APIs

Budget 에 따라 tens 와 hundreds of billions param 으로 나눠서 살펴보자.

(1) Models with Tens of Billions of Parameters.
LLAMA 와 LLAMA2 (70B, Meta AI), NLLB(54.5B), Falcon(40B) 를 제외하고 대부분은 10B ~ 20B 에 속한다. T5-large, PunGu-$\alpha$, T0, CodeGen, Flan-T5, GPT-NeoX-20B 등이 속한다. 이 중 FLAN-T5 는 instruction tuning 을 활용한 premier model 격으로, 학습시 (1) increasing the number of tasks, (2) scaling the model size, (3) fine-tuning with chain-ofthought prompting data 를 활용하였다.

Code Generation 에서는 CodeGen(11B) 이 모델이 좋은 성능을 보인다. 이 논문에서는 MTPB 라는 benchmark 를 제시하였고, 이는 LLM 정도의 scale 은 되어야 풀 수 있다. CodeGen2 와 StarCoder 역시 tens of billions model 에 속한다.

Multilingual Setting 에서는, mT0 (13B) 가 좋은 성능을 보이며, PanGu-$\alpha$의 경우 중국어에서 성능이 좋다.

LLaMA (65B) 는 다른 모델보다 대략 5 배정도의 파라미터를 갖는 모델답게, 이 체급에서는 가장 강력한 성능을 보인다. 특히 instruction following (instruction tuning) 에서 강력한 면모를 보이는데, 어느 정도 위에서 기술한 emergent ability 와 궤를 같이 한다고 볼 수 있다. LLaMA2 는 LLaMA 에 RLHF 를 적용하여 발전시켰고, 추가적으로 chat-oriented version 인 LLaMA-chat 으로도 발전하였다. LLaMA 는 체급이 (hundreds of billions LLM 에 비해) 낮고, 공개되어있어서, research 에서 매우 ‘핫’하게 사용이 되고 있다. 최근, Falcon 모델은 RefinedWeb 이라는 정제된 Pretraining dataset 을 학습하여 좋은 성능을 보이기도 하였다.

Typically, 이 체급의 모델들 역시 수백에서 수천개의 GPU 혹은 TPU 를 사용해야 pretraining 할 수 있다. GPT-NeoX-20B 의 경우, 8 개의 A100-40G 서버 12개 (96개 A100-40G)를 사용했고, LLaMA 는 2,048 개의 A100-80G 을 사용했다.

(2) Models with Hundreds of Billions of Parameters.
이 체급에서는 publicly released 된 것은 많지 않다. OPT, OPT-IML, BLOOM, BLOOMZ 는 GPT-3 와 유사한 175B 정도의 param 을 가지며, GLM 과 Galactica 는 각각 130B, 120B 의 param 을 갖는 공개된 오픈소스 모델이다. 이 체급은 pretraining 시 엄청난 양의 GPU 를 필요로 한다. OPT-175B 는 992개의 A100-80G 를, GLM-130B 는 752개 A100-40G 를 사용하여 학습하였다.

(3) LLaMA Model Family.

2023년 2월, Meta AI 에 의해 LLaMA family 가 처음 공개되었다(7B, 13B, 30B, 65B). Open resource 중 가장 강력한 성능을 보여 LLaMA 는 research 에서 매우 많이 사용되는 모델이 되었다. 많은 연구자들이 LLaMA 를 instruction tuning 과 continual pretraining 의 baseline 으로 활용하였다.

이 중 Stanford 대학의 Alpaca 모델은 first open instruct-following model fine-tuned based on LLaMA (7B) 모델이다. 이들은 text-davinci-003 의 sef-instruct 기법을 적용하였다. 이 instruction data 는 Alpaca-52K 이고, 이 때 사용된 코드는 뒤에 Alpaca-LoRA, Koala, BELLE 등의 모델에 사용되기도 하였다. Vicuna 역시 유명한 LLaMA variant 이며, 특히 multimodal language model 에서 LLaVA, MiniGPT-4, InstructBLIP, PandaGPT 등의 출현을 이끈 모델이기도 하다.

(4) Public API of LLMs.
모델은 locally 서버에 올려서 inference 하는 대신, API 를 활용한 연구가 학계/업계에서도 활발하다. GPT-3 에서의 ada, baggage, curie, davinci 등이 그 예시이다.

3.2. Commonly Used Corpora for Pre-training.

LLM 모델은 다양하지만, Pre-training corpus 는 비슷비슷하다. 크게 6개로 나눠서 소개하면: Books, CommonCrawl (CC), Reddit links, Wikipeida, Code 그리고 others 이다. 위의 표에서 많이 사용되는 pre-training dataset 들을 살펴볼 수 있다.

아래는 세 대표적인 LLM 의 pre-training dataset 모음이다.

GPT-3 175B : mixture of 300B tokens; CommonCrawl, WebText2, Books1, Books2, Wikipedia
PaLM 540B : mixture of 780B tokens; social media conversations, filtered webpages, books, Github, multilingual Wikipedia, and news.
LLaMA ; CommonCrawl, C4 [82], Github, Wikipedia,books, ArXiv, and StackExchange. The training data size for LLaMA (6B) and LLaMA (13B) is 1.0T tokens, while 1.4T tokens are used for LLaMA (32B) and LLaMA (65B).

3.3. Commonly Used Datasets for Fine-tuning

(1) Instruction Tuning Datasets

※ 각 dataset 에 대한 자세한 설명은 논문 참조.

(2) Alignment Datasets

※ 각 dataset 에 대한 자세한 설명은 논문 참조.

3.4. Libary Resource

LLM 개발을 위한 library 들을 간단히 소개한다.

Transformers : Hugging face 에서 관리하는 python transformer API.
DeepSpped : Microsoft 에서 관리하는 PyTorch deep learning optimization API.
Megatron-LM : NVIDIA 에서 관리하는 large-scale LM training 을 위한 deep learning library. data parallelism, mixed-precision training, FlashAttention 등을 포함한다.
JAX : Google 에서 관리하는 high-performance machine learning algorithm library. TPU 호환이 장점이다.
Colossal-AI : HPC-AI 에서 관리하는 large-scale AI training 을 위한 tool.
BMTrain : OpenBMB 에서 관리하는 large-scale param 모델을 효율적으로 관리하도록 도와주는 tool. FLAN-T5, GLM 등을 쓸 수 있다.
FastMoE : MoE (Mixture-of-Experts) 관리 Tool.
vLLM : high serving throughput, effective attention memory management using PagedAttention, continuous batching, and optimized CUDA kernels 등을 활용한 fast, memory-effeicient LLM inference tool.

A Survey of Large Language Models (2) 에서 계속…

[ICML2023] Exploring the Benefits of Training Expert Language Models over Instruction Tuning

Sun, 17 Dec 2023 12:43:00 +0000

[pdf] [github]

Joel Jang ^1*, Seungone Kim ^1*, Seonghyeon Ye ^1*, Doyoung Kim ^1*, Lajanugen Logeswaran ², Moontae Lee ^2,3, Kyungjae Lee ², Minjoon Seo ^1*
¹ KAIST ² LG AI Research ³ University of Illinois Chicago. Correspondence to: Joel Jang joeljang@kaist.ac.kr.

Abstract

(Mutlitask prompting) LM 에 여러 가지 multitask 에 intruction tuning 을 진행하는 multitask prompted fine-tuning (MT) 을 useen task 에 대해서 좋은 능력을 보여왔다. 기존에, training task 의 수를 scaling 함으로써 성능 향상이 있다는 연구들이 많았다.
(Motivation) 저자들은 놀랍게도, 단 하나의 task 에 fine-tuned 된 Expert LM 이 300 개 이상의 태스크로 학습된 MT-LM 과 비교하여, BIG-benchmark 의 13개에 대해서 1.29%, 11개의 unseed dataset 에 대해서 3.20% 의 성능 우위가 있음을 발견하였다.
이는 MT-LM 을 강력하게 하기 위해 task 의 수를 scaling 해야 한다는 기존의 연구에 의문점을 제시한다.
이에 더해 single MT-LM 을 대신해 task 별로 seperate expert LM 을 학습시키는 것이 zero-shot inference 에 도움이 될 수 있음을 보인다. 이는 (1) instruction tuning 과정에서 종종 일어나는 negative task transfer 를 방지하고, (2) re-train 이나 catastrophic forgetting 없이 continual learning 을 가능하게하며, (3) 각각의 expert 를 혼합하였을 때 compositional capability 를 보인다.

Introduction

최근 Pretrained Language Model (PLM) 을 여러가지 task 에 instruction tuning 하는 MT-LM의 연구가 활발하다. 이는 성능이 매우 좋다고 알려져 있다. 그러나, 이 연구에서는 두 가지 파트로 나누어 MT-LM 의 current paradigm 에 의문점을 던진다.

Part1

기존에는 MT-LM 의 unseen task 에 대한 generalization 능력은 training 과정에서 배운 task 수에 scaling 한다는 연구가 많았다. 그러나 이 연구에서 우연히도, 단 하나의 task 를 배운 expert LM 이 300 개 이상의 task 를 배운 T0-3B 를 non-trivial margin 으로 이긴 것을 발견하였다.

이에 저자들은 T0-3B 를 학습시킨 296 개의 task 를 각각 하나씩만 배우게 expert LM 들을 학습시켰다. 이 296 개 중 7 개의 expert LM 이 T0-3B 의 unseen task 에 대해 더 높은 성능을 보인다(Figure 1). 이 7 개의 expert 로 부터 11개의 unseen task 를 측정했을 때 3.2%, Big bnech 에서는 1.29% 성능 우위를 보였다. 저자들은 또한 relevant expert 를 retrieve 하는 간단한 메커니즘을 통해 각각의 unseen task 에서 T0-3B 를 압도하는 성능을 얻을 수 있음을 보인다. 무려 12% 에 까까운 improvement 를 통해, 단순히 single MT-LM 을 나이브하게 학습시키는 것보다, 올바른 expert 를 choosing 하는 것이 더욱 효과적이고 효율적인 방법이라는 것을 보인다.

Part2

저자들은 위의 발견 외에도 RoE (Retrieval of Expert)가 MT-LM 보다 나은 세 가지 다른 advantage 를 발견한다.

MT-LM 은 가끔 ‘seen’ task 에 대하여 negative task transfer 에 의해 sub-optimal performance 를 보인다. 이는 여러 가지 task 를 한 번에 배우는 것이, 오히려 특정 몇 개의 task 학습을 방해하는 것이다. 그러나 Expert LM 은 각각 의 task 를 독립적으로 학습하기 때문에 이러한 문제에서 자유롭다. 실험 결과, T0-3B 와 비교했을 떄, 36 개의 training task 에서 10.4% 에 해당하는 성능 우위를 보인다.
MT-LM 은 catastrophic forgetting 문제가 있다. 그러나 RoE 방법은 이 문제가 전혀 없다. (absolutely no degredation)
MT-LM 은 두 개의 task 를 composition 해야 할 경우 성능이 좋지 않은데 RoE 는 그렇지 않다. mT5-3B 두 개를 각각 summarization 과 translation expert 로 학습시킨 후, 이를 composition 했을 때, mT0-3B 과 summarization + translation 성능을 비교했을 때 우위를 보인다.

Expert Language Models

Training Experts

Training 과정에서는 Adapter 를 활용하여, parameter-efficient fine-tuning 을 진행한다. 이는 underlying LM 은 freeze 하고 adapter 부분만 학습하는 것이다. 위의 Figure 3 와 같이, 각각의 prompt 에 해당하는 task 를 배우는 Prompt Expert (PE) 들과, 각각의 Dataset 을 multiple training prompt 로 학습하는 Dataset Expert (DE) 로 나눌 수 있다. PE 를 학습할 때는 adapter 만 학습하고, DE 를 학습할 때는 전체 LM 을 다 학습한다.

Adapter 에 대해서 설명하면,

보통 Transformer 의 각 layer 에서는 위의 (1) 식의 hidden state 들을 (2) 식으로 self-attention 하는 과정들의 연속으로 이뤄진다. 이 때, Adapter 는

(1)식을 (3)식으로 바꿔서 hidden dimension e 로 보내는 FFN 을 추가한 뒤, 나머지는 freeze 하고 이 FFN 만 학습한다.

Retrieval-of-Experts (RoE)

이렇게 각각의 Expert 를 학습한 이후, Expert Library 를 구축 한 뒤, Dense retrieval 을 활용한다.

우선 저자들은 Expert Library 를 구성한다. 각각 Expert LM 이 학습한 S 개의 training instance 를 랜덤 샘플링 하여 library 를 구성한다. 따라서 expert library 의 크기는 [S X # of experts] 이다. 각각의 sentence 들로 부터 embedding 을 얻기 위해 Sentence Transformer 를 활용하였다.

이후 Retreival 과정에서는, 추론 과정의 target task 에서 $Q$ 개의 query 를 추출한다. 이후 $Q$ 개의 query 로 부터 MIPS (Maximum inner product search) 를 통해 Expert library 에서 $Q$ 개의 expert 를 가져온 후, 이 중 가장 많이 retreived 된 expert 를 선택한다.

마지막으로, Cold-fusion 이라는 연구에서, individually fine-tuned LM 을 merging 하는 것이 multitask fine-tuning 의 가능성을 보인다는 연구에 따라, Retrieved 된 expert LM 를 합쳐서 새로운 expert LM 을 제시한다. 합치는 방법은

위와 같으며, 여기서 $\tau$ 는 vanialla pre-trained LM 과 expert LM 과의 parameter difference $(\theta_{expert} - \theta_{vanilla})$ 이다. 따로 언급이 없으면 $\lambda$ 는 1/N 으로 expert LM 들을 uniformly merging 한다.

Experimental Setup

Training setup

T0 의 36 개 training dataset 을 활용한다. Prompt 는 T0 의 것을 활용하며, 296 개의 prompt (task) 가 있다. 따라서 296 개의 PE 와 36 개의 DE 가 생성된다. LM-adapted T5 model checkpoint 를 baseline 으로 활용하였다. Epoch 은 5 이며, lr 은 1e-4, expert library 를 위한 S=100 이다.

Evaluation setup

비교를 위한 MT-LM 은 T0-3B 와 T0-11B 이고, RoE 는 T5-3B + DE/PE 이다. T0 original paper 의 세팅 처럼, 11개의 unseen dataset 은 4 개의 category 가 되고, BIG-bench 로부터 13개의 dataset 을 활용한다. 추가적으로 T0 가 배우지 않은 8 개의 new generative task 를 활용한다. Inference 때, RoE 를 위한 $Q$ 는 32 로 고정한다.

Expert LMs Can Generalize to Unseen Tasks

Expert LM 이 새로운 패러다임이 될 수 있음을 실험적으로 검증한다. 아래의 Table 1 은 11 개의 unseen dataset 에 대한 결과, Table 2 는 BIG-Bench 13 개에 대한 결과, Table 3 는 8 개의 unseen genertavie task 에 대한 결과이다.

우선 Table 1 에서, T5(3B) + Cos PE (Cosmos-QA dataset 에 no_prompt_text) 가 T0-3B 를 11개 중 8개를 앞질렀다. 이는 MT-LM 의 scaling 에 대한 기존 연구 결과를 뒤집을 수 있는 결과이다. Table 2 에서 역시 Cos PE 가 가장 높은 mean acc 를 보인다. Table 3 에서도 T5 + Sam PE (Samsum dataset 에 ‘given the above dialog wirte a summary’ prompt) 가 T0-3B 를 8개 평균 6.83 점을 앞선다.

그리고 또, Table 1 에서 RoE 과정에서 Oracle 로 expert LM 을 가져와서 best performing 을 측정했을 때, T0-3B 뿐 아니라 더 큰 T0-11B, 심지어 GPT-3 보다도 각각 11.94%, 2.61%, 4.37% 증가한 것을 볼 수 있다. Table 3 에서 Oracle 은 13.69 점이나 증가하였다.

마지막으로, Oracle 이 아닌 RoE 방법을 통한 T5 + PE w/ RoE 는 11개 중 8 개의 unseen task 에서 T0-3B 를 앞질렀다. Oracle 과 비교했을 때, 여전히 성능 개선 여지가 충분하기 때문에 retriever side 에서 개선의 여지가 충분한 것도 볼 수 있다.

Merging of Experts

Table 4 에서 (Mer.) 표시는 Expert LM 들을 merging 한 것이다. 첫 세 줄은 PE LM과 merging LM 의 결과, 아래 세 줄은 DE LM 과 merging LM 의 결과이다. RoE 의 경우, merging 을 하더라도 COPA 등 몇 개의 경우 positive task transfer 가 있었지만, 대부분의 경우에서 negative task transfer 이 있었다.

이에 분석을 위해, Full LM training 을 하는 DE 를 merging 한 것이 아래의 세 줄의 결과이다. merging 을 한 것이 대부분의 결과에서 가장 좋거나, 두 번째로 좋은 결과를 내기 때문에, DE merging 은 negative task transfer 없이 composition ability 를 보인다고 주장할 수 있다.

Analysis of Experts

다시 Figure1 으로 돌아가서 저자들은 세 가지 측면의 분석을 제시한다.

첫 번째로, 8 개의 Training task category 중 유일하게 Multiple-Choice QA (MCQA) task 가 좋은 generalization 성능을 보인다. 이에 저자들은 11 개의 classification setting task 가 QA 형태를 instruction 에 필요로 하기 때문이라고 가정한다.

두 번째로, 36개 training dataset 에 대해, COSMOS-QA, SOCIAL-I_QA, DREAM 3 개의 training dataset 에 대해서만 consistently PE 든 DE 든 성능이 좋다. 이 세 데이터셋은 모두 commonsense reasoning dataset 이고, 이는 unseen task 에 대한 generalization 에서 필수불가결하다.

마지막으로, T5 + SAM PE 가 Table 3 에서 가장 좋은 성능을 보인다. SAM PE 는 SAMSUM dialog summarization dataset 에 대한 Expert LM 이다. 그러나 이 모델은 Table 1,2 dml classification setting 에서는 T0-3B 보다 10% 가까이 안좋아서 there’s no free lunch 를 보여준다.

Benefits of Expert LMS over MT LMs

Seen Task Performance

먼저, expert LM 이 negative task trasnfer 에 영향을 적게 받음을 보이기 위해, T5(3B) + PE W/ ROE의 성능을 36개의 Validation datset 에 대해, 두 MT LM 모델 T0-3B 및 T0-11B 과 비교한다. 위의 표에 나타난 대로, 각각 mean accuracy 에서 T0-3B 및 T0-11B보다 각각 +10.40% 및 +7.70% 더 높은 성과를 보인다.

이는 평가가 seen instruction 으로 이루어지기 때문에, 간단한 검색 메커니즘이 expert library 에서 best-performing expert 를 선택할 가능성이 높기 때문이다. 이는 T5(3B) + PE W/ ROE(ORC.)와 유사한 성능을 나타내는 것에서 알 수 있다. 실제로 T5(3B) + PE W/ ROE는 보이는 작업 중 296개에서 280개의 작업에서 동일한 Training 데이터셋에서 PE 를 검색하며, 296개의 작업 중 185개에서 동일한 prompt 와 dataset(oracle 에 해당하는)에서 PE를 검색한다.

Continual Learning of New Tasks

모델 배포 이후 추가 데이터셋에서 언어 모델을 미세 조정하고자 할 때, 미세 조정된 LM 을 continual learner 로 만드는 것이 중요하다 (Chakrabarty et al., 2022). 전체 original and additional task in each update 는 계산 부담이 크기 때문이다. 이전 연구는 Rehearsal-based 방법을 통해 이 문제를 해결하며, Fine-tuned LM을 original and additional dataset 에 conitnual learning 시킨다. (Chakrabarty et al., 2022). 그러나 이 접근 방식은 (1) 원본 데이터에 액세스할 수 있다고 가정하고 (2) instruction tuning 중 additional 샘플을 continual trainig 시키는 데 여전히 추가 계산 부담이 발생한다.

이 연구에서는 각각 별도의 Expert LM 을 각 additional task에 대해 training 시켜 전문가 라이브러리에 단순히 추가하는 distributed training 을 통해 original and additional dataset 에 액세스하지 않고도 동일한 결과를 얻을 수 있다는 것을 보여준다. 구체적으로, MT-LM (T0-3B) 을 continually training 시켜 CT0-3B 로 만드는 방법과 제안하는 distributed approach 간의 비교를 Table 6 에 제시했다.

표의 결과를 보면 제안하는 방식이 seen task 대한 성능 저하가 전혀 없고, unseen task 에 대한 경미한(-0.15%) 성능 저하를 보인다. 게다가, 평균적으로 +1.08 의 성능 우위가 MT LM 대조군에 비해 존재함을 보여준다. 이로써 original 데이터에 액세스하거나 무거운 계산 비용이 들지 않는 상태에서 distributed approach 는 대부분의 경우 원래 능력(seen task and unseen task)을 유지할 뿐만 아니라 target task 에서 CT0-3B를 능가한다.

Compositional Instructions

우리는 아래처럼 두 개의 instruction 을 합쳐서 줄 수 있다: “Write a summary of the following English text and translate the sentence into Korean.” where “Write a summary of the following English text.” and “Translate the sentence into Korean.” are two separate instructions seen during training.

이 compositional capability 를 테스트하기 위해, mT0-3B 를 MT-LM 으로 하고, 5 개의 summarization 과 translation 의 compositional task 를 학습시켰다. 이후, 제안하는 distributed approach 로 mT5-3B 두 개에 각각 summuarization 과 transaltion 을 학습시킨 후 ,Merging 을 했을 때, 5 개 중 4개의 task 에서 좋은 성능을 보였고, 한국어와 일본어 같은 low-resoruce language 에서는 더 큰 차이를 보였다. 왜냐하면, low-resource language 는 학습 과정에서 negative task transfer 에 의해 학습이 방해되기 때문이다. Table 8 에서는 cherry-picked 된 결과를 보여준다.

Conclusion

Expert language models trained on single tasks exhibit strong generalization to unseen tasks, surpassing multi-task language models by a significant margin, showcasing benefits in robustness, adaptability, and compositional instruction performance. The proposed distributed approach encourages exploration of collaborative expert training for potential future advantages in efficiency, privacy, and personalization, not explicitly covered in this paper (see limitations and discussion in Appendix D).

[ICML2023] Large Language Models Struggle to Learn Long-Tail Knowledge

Sun, 17 Dec 2023 11:08:00 +0000

[pdf] [github]

Nikhil Kandpal ^1*, Haikang Deng ^1*, Adam Roberts ², Eric Wallace ³, Colin Raffel ^1*
¹ UNC Chapel Hill ² Google Research ³ UC Berkeley. Correspondence to: Nikhil Kandpal nkandpa2@cs.unc.edu.

Abstract

(Motivation) LLM 은 인터넷 속의 많은 지식을 배우지만, 특정한 정보는 web 에 흔하지만, 어떠한 정보는 그렇지 않다.
이 논문에서는 LLM 에 의해 기억된 knowledge 와 web 으로부터의 pre-training dataset 속의 정보의 관계를 연구한다.
정확히는, fact 기반의 question 을 답하는 LM 의 능력은, pre-training 시에 그 question 에 연관된 documents 를 얼마나 많이 보았는지에 relate 되었다는 것을 보인다.
(Long-tail knowledge 약점) 오늘날의 모델들은 long-tail knowledge 에 취약하며, retrieval-augmentation 이 그 개선에 큰 역할을 함을 보인다.

Introduction

LLM 은 well-known factiid 로부터 asoteirc domain-specific information 에 이르기 까지 많은 정보를 인터넷으로부터 얻는다. 모델은 이를 parameter 에 implicit 하게 저장하므로, 현재 LLM 의 크기와 pre-training dataset 의 크기는 매우 크다. 이를 위해 현재 인터넷으로부터 학습을 많이 진행하는데, 인터넷 속의 정보들은 equal 하게 등장하지 않으며, 특히 long-tail information 의 경우, 거의 등장하지 않거나 적게 등장한다.

이 연구에서는 LLM 의 답변 능력과, 해당 question 이 pre-training 단계에서 얼마나 많은 document 에 등장하는지의 연관성을 조사한다. 이들은 factoid QA dataset 인 TriviaQA 와 Natural Questions 에 대하여, ground qa pair 가 concrete 한 subject-object 로 연결되는 것에 집중한다. 예를 들어, (In what ciy was the poet Dante born? Florence) 라는 QA pair 에 대하여, Datne-Florence 는 co-occur 할 확률이 높다. 이 co-occurrence 를 identify 하기 위해, C4, Pile, ROOTS, OpenWebText 와 Wikipedia 등의 trillions of token 에 entity linking pipeline 을 적용한다.

이들은 LM 의 능력은 pre-training document 에 등장하는 question 수, pre-training datset 크기, 그리고 모델 사이즈와 연관이 있음을 보인다. 이들은 또한 counterfactual re-training 실험을 수행하여, 4.8B 파라미터 LM 을 특정 document 에 대해 with/with-out train 시킨다. 모델 정확도는 relevant document 가 제거된 question 에서 크게 감소하는데, 이는 entity linking pipieline 을 검증하고 관찰된 상관 관계 경향이 실제로 인과 관계가 있을 가능성이 있다는 것을 보여준다.

마지막으로, 모델 스케일링과 Retrieval-augmentation 을 통해, pre-training 에 거의 등장하지 않는 knowledge 를 더 잘 capture 하는 방법을 분석한다. 모델 스케일과 관련하여, parmeter 수와 QA accuracy 간의 log-linear relationship 이 있음을 보인다. 이는 Long-tail question 에 대해, 일반적인 QA accuracy 를 얻기 위해서는 quadrillion param 정도로 극적인 파라미터 수 증가를 필요로 함을 의미한다. Retrieval-augmented system 은 더 promising 한데, 이들은 LLM 의 long-tail question 에 대한 성능을 크게 향상시킨다.

Identifying Relevant Pre-training Data

Background and Research Question
LLM 은 최근 매우 좋은 성능을 보이지만 한 가지 의문점을 남긴다 : 현재로서는 LM 이 실제로 어떤 종류의 지식을 포착하는지 여전히 명확하지 않다. 예를 들어, 그들은 단순히 Pre-training 데이터에서 빈번하게 나타나는 “easy” fact 들만을 학습하는 것인가? 저자들은 in-context learning (ICL) 을 통해 QA 몇 개 를 prompt exempler 로 주고, 문제를 풀게 한 다음, LM 의 능력과 pre-training data 속의 관련된 정보량 사이의 관계를 조사한다.

Approach
저자들은 우선 question 속에 포함되어 있는 salient entity 를 찾아내고, ground-truth answer 의 alias set 들을 찾아낸다. 이후, salinet question entity 와 answer entity 가 co-occur 하는 pre-training document 를 찾아낸다. 위의 Figure 2 에서, salient question entity 와 salient answer entity 로 Dante-Alighieri 와 Florence 를 추출한 뒤, both entity 를 담고 있는 document 를 count 한다.

이들의 approach 는 T-rex et alhttps://aclanthology.org/L18-1544/. 을 기반으로 하는데, 이 연구는 subject-relation-object triplet 중 subject 와 object 두 개만이 text 에 co-occur 하면, 그 triplet 역시 존재한다는 연구를 바탕으로 한다. 추가적으로, human study 를 통해 이 counting pipeline 이 relevant document 를 잘 추출함을 보인다. 이러한 발견들을 기반으로, 저자들은 salinet question - answer entity 를 통해 찾은 document 를
relevant document라고 정한다.

이 방법을 적용하기위해, massive pre-training corpora 를 entity-link 해야하고, downstream QA dataset 역시 적용해야 한다.

1. Entity Linking Pre-training Data

저자들은 DBpedia Spotlight Entity Linker 를 활용한다. Pre-training data 는 (1) The Pile, (2) ROOTS, (3) C4, (4) OpenWebText, (5) Wikipedia 이다. 이 과정은 128-CPU-core machine 으로 3 주 정도 걸려서, 2.1 TB 의 entity link 를 생성한다.

2. Finding Entity Paris in QA Data

저자들은 TriviaQA 와 Natural Question 두 개의 open-domain QA dataset 을 entity link 한다. Few-shot prompt example 로 쓰이는 일부 data 를 제외하고, training data 와 validation data 를 모두 사용한다. 먼저 각 예제에 대해 DBPedia Entity Linker 를 실행한다. 하나의 질문에 대해 여러 답변이 있을 수 있으므로 질문과 모든 유효한 답변을 연결한다. 이렇게 함으로써 더 정확한 Entity linking 이 가능해진다. 이후, ground truth 집합에서 발견된 most common entity 를 salient answer entity 로 사용한다. 그런 다음 질문에서 발견된 모든 entity 를 반복하면서, pre-training dataset 에서 salient answer entity 와 가장 많이 co-occur 하는 entity 를 선택한다. 질문, 답변 또는 둘 다에서 entity 를 찾을 수 없는 경우 예제를 삭제한다. 결과적으로 관련 문서의 수가 0 이면 역시 예제를 삭제한다. 이는 엔터티 링킹 오류 때문일 가능성이 있기 때문이다.

3. Human Evaluation of Document Counting Pipeline

인간 평가 결과, TriviaQA 의 60% 정도에 해당하는 것들이 relevant 하다는 결과를 얻었다. Pipeline 이 완벽하지 않은 이유는 entity linker 가 잘못되었거나, saline question entity 와 salient answer entity 를 모두 가졌다고 해서 모든 document 가 relevant 하지 않기 때문이다. 그러나, 추후 실험에서 이 pipeline 이 충분히 efficient 하다는 것을 검증한다.

LM Accruacy Depends on Relevant Doucment Count

우선, LLM 의 답변 능력과 pre-training corpus 에서의 relevant document 의 수의 관계성을 측정한다. 일들은 GPT-Neo family, BLOOM family , GPT-3 family 에 대해서 실험한다. GPT-Neo 는 Pile pre-training dataset 에 BLOOM 은 ROOTS 에, GPT-3 는 알려지지 않았지만 OpenWebText 를 통해 시뮬레이션 하였다. 모델이 new line(\n) 을 생성할 때까지 greedy decoding 시켰고, Exact Match (EM) 을 메트릭으로 사용한다.

BLOOM 의 실험 결과는 Figure 1 (맨위)에, GPT-Neo 의 실험결과는 Figure 3 에서 볼 수 있다. 두 실험 모두 같은 경향성을 보이며, NQ 에 대한 GPT-Neo 의 실험결과는 Figure 4 에서 볼 수 있으며, 역시 같은 경향성을 보인다.

Simpler Methods for Identifying Relevant Documents Are Less Effective

현재의 방법은 salient Q entity 와 salient A entity 가 co-occur 하는 것을 relevant document 로 정의했는데, salient Q entity 혹은 salient A entity 만 등장하는 것을 relevant document 로 해서 count 한 것과의 연관성을 본다. 왼쪽에서, Q 와 A 만으로도 증가하는 것처럼 보이지만, Q 와 A 가 co-occur 하는 것이 5 번 보다 적은 document 들로 한정하여, Q entity 혹은 A entity 만을 relevant 의 기준으로 삼으면 성능이 좋아지지 않는 것을 보인다.

Humans Show Different Trends Than LMs

이러한 결과에 대한 설명은 relevant document 수가 적은 질문이 단순히 “harder”하는 것이며, 이로 인해 모델 성능이 하락하는 것일 수 있다는 것이다. 그러나 이것이 사실이 아님을 Natural Questions에서 인간 정확도를 측정함으로써 보여준다. 저자들은 5명의 다른 인간 평가자에 의해 레이블이 지정된 질문을 사용하여 하나의 평가자를 제외하고 나머지 네 명을 참 값 답변 집합으로 사용하는 “leave-one-annotator-out” 지표를 사용한다. Figure 7 에 인간 정확도 대 관련 문서 수를 보면, 인간 정확도는 실제로 관련 문서가 적은 질문에 대해 가장 높은데, 이는 LM 과 반대 경향성이다. 저자들은 가설을 세우는데, 관련 문서가 적은 질문에 대해 인간이 더 뛰어난 이유는 (1) 더 rarer 한 사실에 관한 질문이 일반적인 entity 와 비교하여 간단한 문제일 가능성이 있기 때문이며, (2) 평가자에게 제공되는 위키피디아 문서가 rare 한 entity에 대해서는 더 짧기 때문에 독해가 더 쉽고 평가자 간의 일치도가 높아지기 때문이다.

Causal Analysis via Re-training

지금까지의 결과는 상관적인 성격을 가지고 있다. 이러한 결과를 설명할 수 있는 알려지지 않은 변수가 있을 수 있으며, 다시 말해 더 rarer 한 질문이 다른 이유로 LM 에게 더 어려울 수 있다. 여기서는 훈련 데이터에서 특정 문서를 제거하고 LM을 다시 훈련함으로써 인과 관계를 확립해본다.

우선은 기존 연구 (Wang et al. 2022)에서의 설정을 따라 C4 에서 4.8B LM d을 훈련시킨다. 그런 다음 Training dataset 에서 특정 문서를 삭제하는 효과를 측정한다. 각 관련 문서 수의 로그 스케일 bin에 대해(예: $10^0$에서 $10^1$ 관련 문서, $10^1$에서 $10^2$, …) Trivia QA에서 100개의 질문을 샘플링하고 C4에서 해당 질문의 모든 관련 문서를 제거한다. 이로써 전체 C4의 약 30%가 제거된다. 마지막으로 이 수정된 Pre-training dataset 에서 “counterfactual” LM을 훈련시키고 이를 기준 모델과 비교한다. 기준 모델과 대조 모델 모두 1회 에폭 동안 훈련하는데, 대조 모델은 30% 더 적은 단계 동안 훈련되며, 이로 인해 전반적으로 성능이 약간 떨어진다. 이를 감안하여 관련 문서가 제거된 질문에 대한 성능만을 고려한다.

결과는 위의 그림과 같다. 원래 C4 데이터셋에서 관련 문서가 적은 질문의 경우, 기준 및 대조 LM 모두 성능이 나쁘며, 즉 그들의 성능 차이가 작다. 그러나 많은 관련 문서가 있는 질문의 경우, 대조 LM의 성능이 크게 나빠진다. 이는 관련 문서 수와 QA 성능 간에 인과적인 연결이 있다는 것을 시사한다.

Methods to Improve Rare Fact Learning

지금까지 LLM 이 Relevant Document count 에 강한 dependency 를 가지고 있음을 확인하였다. 여기서는 이 dependene 을 완화하는 방법을 연구한다: 데이터 규모 증가, 모델 규모 증가, Retrieval augmented system 추가

Can We Scale Up Datasets?

최근 LLM 들은 몇 백 Billion parameter 로 학습된다. 실험 결과는 relevant document 가 log scale 로 커져야하기 때문에, 5 배 정도의 pre-training dataset size 증가는 큰 도움이 안될 것이다. 그러면 pre-training dataset 의 diversity 를 늘리는 것은 가능할까? 위의 표를 보면 놀랍게도, 모든 pre-training dataset 이 독립적으로 crawling 된 것임에도 불구하고, TiriviaQA 와의 correlation 은 거의 유사하여, 그렇게 까지 ‘다른’ pre-training dataset 이라고 하기 어렵다.

Can We Scale Up Models?

Long-tail question (relevant document 수가 적은 question) 의 경우, 성능 향상을 보장하기 위해서는 기하급수적으로 크기를 늘려야한다. 위의 figure 에서, NQ dataset 중 100 개 이하의 revlant docs 을 갖는 long-tail 질문에 대하여, BLOOM 모델이 strong supervised model 이나 Human accuracy 에 비슷해지기 위해서는 무려 $10^18$ 까지 증가시켜야 한다. (증가시키진 않고 extrapolating 으로 $R^2=0.98$ linear fit 했을 때)

Can We Use Retrieval Augmentation?

Orcale Retrieval

위의 그림에서 보듯이, Wikipedia 의 gold paragrph 를 oracle 로 주어주면, GPT 가 적은 relevant document 질문에 대해서도 struggle 하지 않고 잘 하는 것을 볼 수 있다. 직전의 human 의 경험과 비슷하게, relevant document 크기가 큰 질문일 수록 성능이 작아진다. (because rare questions are easier on average when relevant context information.)

BM25 Retrieval

다음으로 common retrieval augmented baseline 을 따랐을 때, BM25 검색기 (Robertson & Zaragoza, 2009)를 사용하여 위키피디아에서 단락을 선택한다. 상황별 훈련 예제와 테스트 질문 모두에 대해 상위 3개의 가장 높은 점수를 받은 단락을 추가한다. 각 상황별 훈련 예제에 대해 검색된 단락 중 적어도 하나가 답변을 포함하도록 확인하여 LM 이 문서를 활용하는 방법을 학습하도록 하는 것이다.

먼저 BM25 검색기의 상위 k 리콜을 관련 문서 수의 함수로 평가하고 결과를 Figure 8 에서 볼 수 있다. BM25가 특히 k의 큰 값에 대해 상당히 높은 리콜을 달성한다는 것을 발견할 수 있다. 그러나 BM25 검색기는 여전히 관련 문서 수에 대한 약한 의존성을 보여준다.

다음으로 Natural Questions에서 BM25-보강 GPT-Neo 모델의 정확도를 평가하고 결과를 Figure 9에서 볼 수 있다. 전반적으로 Retrieval-augmented model 은 모든 관련 문서 수 범위에서 closed-book 모델을 능가하며, 특히 rare example 에서 더 뛰어납니다. 이러한 결과는 검색 보강이 사전 훈련 데이터셋에서 관련 문서가 적은 질문에 대한 성능 향상을 위한 유망한 방법을 제공한다는 것을 시사한다.

Conclusion and Future Work

Large language models, trained on extensive internet text, exhibit notable few-shot learning capabilities. The release of LLMs and their pre-training datasets to the open source allows researchers to explore the origins of these capabilities. Our study, one of the first to connect LLM behavior to pre-training data, reveals that while LLMs perform moderately on open-domain QA benchmarks, their success is mainly limited to questions reflecting widely available pre-training knowledge. This prompts further investigation into enhancing long-tail knowledge retention, beyond simple scaling of model and dataset size. We are particularly enthusiastic about refining retrieval-augmented LMs, emphasizing efficiency and accuracy. Additionally, our focus on factoid question answering knowledge learning raises questions about similar relationships in other tasks. While our analysis centers on memorization's impact on question answering, it may extend to tasks involving (or avoiding) memorized knowledge, such as analyzing private text, common-sense reasoning, or predicting source code. Lastly, we encourage ongoing few-shot learning evaluations to uncover model behavior by tracing accuracy back to pre-training data properties, providing insights into existing models' strengths, weaknesses, and potential avenues for improvement.

[ICML2023] A Watermark for Large Language Models

Sun, 17 Dec 2023 05:05:00 +0000

[pdf] [github]

John Kirchenbauer ^1*, Jonas Geiping ^1*, Yuxin Wen ¹, Jonathan Katz ¹, Ian Miers ¹, Tom Goldstein ¹
¹ University of Maryland. Correspondence to: John Kirchenbauer jkirchen@umd.edu.

Abstract

(Motivation) LLM 의 potential harm 은 모델 출력물에 watermarking 을 적용함으로써 완화될 수 있다. 인간의 눈에는 보이지 않지만 알고리즘적으로는 짧은 토큰 범위에서 감지할 수 있는 신호를 생성된 텍스트에 포함시키는 것이다.
(Watermark) 텍스트 품질에 미미한 영향을 미치도록 Watermarking 을 내장할 수 있으며, 언어 모델 API나 매개변수에 접근하지 않고도 효율적인 오픈 소스 알고리즘을 사용하여 감지될 수 있다.
(Method) Word token 이 생성되기 전에 랜덤화 된 green token 을 생성하고, sampling 과정에서 이 green token 을 사용한다.
(Experiment) Interpretable p-value test 와 정보이론 framework 을 통해 watermark 의 sensitivity 를 검증하고, multi-billion parameter 를 갖는 OPT (Open Pretrained Transformer) family 에 대해 실험을 검증하였다.

Introduction

Large Language Model (LLM) 의 성능이 폭발적으로 증가함에 따라, fake-news 나 fake web contents 등의 생성을 통한 정치 공작, AI system 을 활용한 academic writing 에서의 cheating 문제 등의 사회 문제로 자리잡고 있다. 게다가, LLM 을 통한 synthetic web data 의 증가(proliferation)하고 있는데, 이들은 human-annotated data 에 비해 품질이 매우 저조하기 때문에 model training 직전에 detected 되고 제거되어야 한다. 이러한 이유로, machine-generated text 에 대한 detection 과 audit 은 필수적이다.

이 논문에서는 watermarking 기법 제안한다. watermark 는 인간은 알아차리기 힘들지만, text 를 synthetic 하다고 identifiable 하게 하는 특정한 hidden pattern 이다. 저자들은 25 token 이내 정도의 short span 으로 synthetic text 를 detect 할 수 있는 efficient watermark 를 제시하고, 이는 human-generated text 를 detect 하는 false-positive 가 통계학적으로 불가능한 획기적인 방법이다.

논문에서 제안하는 Watermark 의 주요 특징을 정리하면 아래와 같다. (1) Language model API 에 대한 접근 혹은 model param 의 사전 지식과 완전히 무관하게 적용이 되는 알고리즘이다. (2) Re-training (e.g. finetuning) 이 없이도 적용 가능하다. (3) Watermark 는 generated text 의 일부만을 활용하기 때문에, 큰 문서의 일부만 생성이 되는 경우에도 detect 가능하다. (4) 워터마크는 생성된 토큰의 상당 부분을 수정하지 않고는 제거할 수 없다. (5) 워터마크가 감지되었을 확신의 엄격한 통계적 측정치를 계산할 수 있다.

위의 Figure 가 watermark 를 감지하는 방법이다. 맨 아래의 watermark 가 적용된 text 의 경우, 만약 인간이 쓴다면 9 개의 green token 이 예상되지만, 28개의 green token 이 감지된다. 통계학적으로 이런 일이 일어날 확률은 6e-14 로, 강력하게 machine 이 만든 텍스트임을 확신할 수 있다.

A caveat: The difficulty of watermarking low-entropy sequences
아래의 두 문장을 보자.
The quick brown fox jumps over the lazy dog.

for(i= 0;i<n;i++) sum+=array[i]

위의 두 문장은 인간이 만든건지 머신이 만든건지 구분이 힘들다. 왜냐하면 이들은 낮은 entropy 를 갖고 있기 때문에, first few toekn 은 strongly determined 되기 때문이다. 따라서 인위적으로 watermark 를 이러한 low entropy 문장에 집어넣는 것은 오히려 perplexity 를 크게 높이는 결과를 가져와 quality 를 떨어뜨린다.

A simple proof of concept

첫 번째로는 simple “hard” red list wateramrk 를 통한 기법이다. 이는 분석하기 쉽고 찾아내기도 쉬우며 remove 하기는 어렵다. 그러나 이 방식은 low entropy 문장에 대해서 poor generation quality 를 초래하는 cost 를 수반한다.

이 방식은 위의 알고리즘에 볼 수 있다. 이 방식은 t-token $s^t$ 에서 등장할 수 없게 하는 토큰들의 집합 pseudo-random red list 를 생성한다. 이 red list 는 $s^{(t-1)}$ 에서 seeded 되기 때문에 entire sequence 에 대한 접근 없이 reproduce 가능하다.

Detecting the watermark.

워터마크가 있는 텍스트를 생성하는 데는 언어 모델에 대한 액세스가 필요하지만, 워터마크를 감지하는 데는 그러한 액세스가 필요하지 않는다. 해시 함수와 난수 생성기에 대한 지식을 가진 제3자 (third party) 는 각 토큰에 대한 빨간색 목록을 다시 생성하고 빨간색 목록 규칙이 얼마나 자주 위배되는지 계산할 수 있다.

다시 말해 아래의 null hypothesis (귀무가설) 에 대한 test 로 watermark 를 detect 할 수 있다.

왜냐하면, red list 가 매번 무작위로 선택되기 때문에 natural writer 는 자연스럽게 자신의 토큰 중 절반 정도에 대해 red list 를 위반할 것으로 예상되며, 반면 워터마크가 있는 모델은 위반을 생성하지 않을 것으로 예상되기 때문이다. Red list 규칙을 위반하지 않고 $T$ 개의 토큰을 생성하는 확률은 $1/2^T$ 이다. 이는 심지어 몇 마디로 이루어진 짧은 텍스트 조각에 대해서도 거의 없는 확률을 의미한다.

귀무가설을 검증하기 위한 더 견고한(Robust) 감지 방법은 one proportion z-test를 사용하는 것이다. 만약, 귀무가설이 참이라면, green list token 의 수 $s_{G}$ 는 $T/2$ 의 value 와 variance $T/4$ 일 것이다. 따라서 z-statistics 는 아래와 같다.

$z$ 가 특정 threshold 이상이면 이 귀무가설을 reject 하고 watermark 가 존재한다고 주장할 수 있다. 만약 $z$ > 4일 경우 귀무가설을 기각하기로 선택한다고 가정하면, 이 경우 false positive 의 확률은 $3 × 10^{(-5)}$ 이다. 이는 $z$ > 4에 해당하는 one-sided p-값이다. 동시에 $T$ 값이 16 이상인 경우 ($s_G=T$에서 $z$ = 4를 만드는 최소값) 어떠한 워터마크가 있는 시퀀스도 감지할 것이다.

How hard is it to remove the watermark?
One proportion $z$-test 를 사용하면 워터마크를 제거하기가 어려워진다. 길이가 1000 인 워터마크가 있는 시퀀스에 대해, 만일 적대적인 사용자 (Adverary) 가 시퀀스에서 200 개의 토큰을 수정하여 Red list 단어를 추가하고 워터마크를 지우려고 한다면, 위치 t의 수정된 토큰은 위치 t에서 Red list 규칙을 위반할 것이다. 게다가 $s^t$의 값은 토큰 $s^{t+1}$ 에 대한 Red list 를 결정하기 때문에, $s^t$ 를 최대한 적대적으로 선택하면 $s^{t+1}$ 이 Red list 규칙을 위반하게 만들 수 있다. 따라서 200 개의 토큰 뒤집기로 인해 최대 400번의 빨간색 목록 규칙 위반이 발생할 수 있다.

공격자에게는 불행하게도, 이 최대한 적대적인 시퀀스에서조차 600개의 남은 녹색 목록 토큰은 여전히 z-통계량 2(600 - 1000/2)/√1000 로 계산되며 이는 약 6.3이며, p-값은 약 10^(-10) 정도이다. 이는 워터마크를 극도로 높은 신뢰도로 쉽게 감지할 수 있게 해준다. 일반적으로 긴 시퀀스의 워터마크를 제거하려면 대략 토큰의 1/4 이상을 수정해야 한다.

위의 분석은 공격자가 워터마크에 대한 완전한 지식을 가지고 있으며 각 선택된 토큰이 최대한 적대적인 경우를 가정한다 (이는 품질에 부정적인 영향을 미칠 가능성이 높다). 실제로, 워터마크 알고리즘을 알지 못하는 경우, 각 뒤집힌 토큰이 빨간색 목록에 속할 확률은 50% 뿐이며, 인접한 토큰도 동일하다. 이 경우 공격자는 200개의 토큰을 수정하여 (기대값 상으로) 200개의 빨간색 목록 단어를 생성한다.

정리하면, 인간이 직접 생성한 문장은 약 50% 정도가 red list 에 속하는 단어들일 것이다. 공격자가 워터마크를 제거하기 위해, Red list 를 알고 있다고 가정해도, 문장의 20% 정도를 뒤집는 것으로는 워터마크를 제거하기 힘들며, 최소한 25% (1/4) 정도 token 을 뒤집어야만 귀무가설을 reject 하고 인간이 생성한 문장이라고 주장할 수 있을 것이다. 그만큼 ‘hard red list’ 규칙으로 생성된 watermark (green 의 연속)은 제거하기가 힘들다.

Drawbacks of the hard red list rule.

하지만, “Hard red list rule”은 낮은 엔트로피 시퀀스를 너무나도 간단한 방법으로 처리하여 문제가 된다. 이 규칙은 언어 모델이 low entropy 시퀀스를 생성하는 것을 방지하기 때문에 문제가 발생한다. 예를 들어, “Barack” 토큰은 많은 텍스트 데이터셋에서 거의 결정적으로 “Obama”가 뒤를 따를 것이지만, “Obama”가 빨간색 목록에 포함되어 있을 수 있다.

따라서, “Soft” watermarking 규칙을 사용하는 것이 더 나은 behavior 이다. 이 규칙은 감지하기 어려운 high-entropy 텍스트에서만 활성화된다. 충분히 높은 total entropy 속에 low entropy sequence 가 쌓여 있는 상황에서도, 해당 구문은 여전히 워터마크 감지기를 쉽게 작동시킬 수 있어서 1.2절에서 설명한 문제를 해결할 수 있다. 더 나아가, Beam search decoder 와 워터마크를 결합할 수 있다. “Irons-in” 하는 빔 서치 디코더를 사용하면 가능성 있는 토큰 시퀀스의 가설 공간을 검색하여 녹색 목록 토큰이 높은 밀도로 나타나는 후보 시퀀스를 찾을 수 있으며, 이는 최소한의 혼란 비용으로 높은 강도의 워터마크를 생성한다.

A more sophisticated watermark

이제 “Soft” Watermark 대해 알아보자. 짧게 정리하면 Red list 에 있으면 절대 뽑히지 않는 “Hard” Red list rule 과 다르게, Red list 에 속해 있더라도 뽑힐 수 있는 확률을 갖는다. 이는 많은 좋은 선택지가 있는 고 엔트로피 토큰에 대해 green list 의 사용을 촉진하면서 거의 결정적인(deterministic 한) 낮은 엔트로피 토큰의 선택에는 거의 영향을 미치지 않는다.

LLM 은 위와 같이 마지막 layer 의 logit 값의 softmax 를 통해 vocab 의 확률 벡터 p 를 결정한다. “Soft” watermark 는 여기에 hardness parameter $\delta$ 를 추가한다. 그리고 0.5 의 확률 대신 green list size $\gamma$ 를 도입한다.

“Soft” red list rule 은 워터마크가 품질에 거의 영향을 미치지 않을 상황에서 워터마크를 강제하면서, 엔트로피가 낮은 경우에는 거의 워터마크 규칙을 무시한다. 다시 말해, $p(t)k ≈ 1$을 가진 highly-likely word 는 다른 후보보다 훨씬 큰 로짓을 갖고 있으며, 이는 Red list 에 포함되어 있더라도 가장 큰 값을 유지한다. 그러나 엔트로피가 높은 경우에는 선택할 수 있는 많은 유사한 logit 들이 있으며, 이 때 δ 규칙은 샘플링 분포에 큰 영향을 미치며 결과를 녹색 목록 쪽으로 강하게 편향시킨다.

Detecting the soft watermark.
“Soft” Watermark 를 감지하는 과정은 기존 “Hard” Watermark 탐지와 동일하다.

임의의 $\gamma$ 에 대해서 $z$ value 는 위와 같다. $z>4$ 인 경우를 다시 한 번 생각하면, 여전히 False-positive 는 $3 × 10^{(-5)}$이다. “Hard” watermark의 경우, 텍스트의 특성과 관계없이 16개 이상의 토큰으로 이루어진 어떠한 워터마크가 있는 시퀀스라도 감지할 수 있었지만, “soft” watermark의 경우, watermark 텍스트를 감지하는 능력은 시퀀스의 엔트로피에 따라 달라진다. 높은 엔트로피 시퀀스는 상대적으로 적은 토큰으로 감지되지만, 낮은 엔트로피 시퀀스는 감지를 위해 더 많은 토큰이 필요하다.

Analysis of the soft watermark

이 섹션에서는 ‘soft’ watermark 에 대한 보다 면밀한 분석을 진행한다. 실제 sampling method 와 다르게, red list 는 uniform 하게 sample 된다고 가정한다. 실제로는 previous token 을 seed 로 하여 random number generator 에 의해 sample 된다.

분석을 위해 “Spike” 라고 하는 새로운 entropy 개념을 정의한다. discrete probability mass $p$ 와 scalar $z$ 에 대해 ‘spike’ 는 아래와 같다.

이 것은 기존의 Shannon entropy 와 유사하게, single location 에 mass $p$ 가 몰려있을 때 가장 적은 $1/{1+z}$ 값을 가지며, uniformly distritubted 되었을 때, 가장 큰 값인 $N/{N+z}$ 를 가진다. 큰 $z$ 값에 대하여, $p_z > 1/z$ 인 경우, 개별 spike 값은 $1/z$ 에 가깝게 되고, $p_z < 1/z$ 의 경우, 개별 spike 값은 0 에 가까워진다. 따라서, spike entropy 해석하면 $1/z$ 보다 큰 확률 $p$를 갖는 entry 의 softened measure 라고 해석할 수 있다.

이를 이용하여 watermakr 속의 green list 의 수를 예측하는 theorem 은 아래와 같다.

Sensitivity of the watermark test
“Soft” watermark의 sensitivity는 standard type-II error analysis 을 사용하여 계산할 수 있다. 설명을 위해, $\gamma$ = 0.5 및 $\delta$ = 2로 설정된 soft watermark 의 Type-II (False negative) error rate 을 추정한다. OPT-1.3B 를 사용하여 C4 데이터셋의 RealNewsLike 하위집합에서 나온 프롬프트를 사용하여 200개의 토큰이 생성되었다고 가정한다. 또한 $z$ = 4인 detection thershold 을 가정하며 (이는 약 128.2/100 토큰에서 발생), 이로 인해 Type-I error (False positive) rate 은 $3 × 10^(-5)$ 이다.

Theoritical bound. Generation은 약 500회에 걸쳐 샘플 당 평균 spike 엔트로피인 S = 0.807을 가지고 있다. Theorem 4.2에 따르면, 한 generation 당 기대되는 Green list token 수는 최소 142.2 이다. 사실, 경험적 평균은 159.5 이다. 엔트로피가 평균값인 (S = 0.807) 경우, Green list token 의 표준 편차가 6.41 토큰 이하이며, 표준 가우스 근사를 사용하여 98.6%의 감도 (1.4%의 Type-II Error rate)를 얻을 수 있다. 이는 특정 엔트로피에 대한 감도의 하한값이다. 이론적인 한계가 아닌 실제 평균값 159.5를 사용하면 $5.3 × 10^(-7)$의 Type-II error rate을 얻을 수 있다. 이는 현실적인 근사치이지만 엄격한 하한값은 아니다.

Empirical sensitivity. Empericially, multinominal sampling 기법을 사용할 때 98.4%의 generation 이 $z$ = 4 (128 토큰) threshold 에서 감지된다. 4-way Beam search over greedy search 의 경우 99.6%의 empirical sensitivitiy 를 보인다. 이론적인 한계와 달리 이들은 모든 generation 에 대해 계산되며, 이들은 길이는 동일하지만 개별 엔트로피는 다른 것들이다. 여기서 Type-II error 의 주요 원인은 낮은 엔트로피 시퀀스이며, 위의 계산에서 엔트로피가 평균값 근처에 있을 때 매우 낮은 오류율을 예상한다는 점을 보여준다. 이를 검증하기 위해 spike 엔트로피가 25번째 백분위수를 초과하는 375/500 하위 집합을 검토하면, 이 중 100%의 generation 이 $z$ = 4 임계값에서 감지된다.

What do failure cases look like? Low entropy(undetectable) 시퀀스는 주로 data memorization 을 포함한다. 즉, 모델이 인간이 쓴 텍스트의 copy(또는 nearly-copy) 를 그대로 토해내기 때문에 이는 기계로 작성된 것으로 감지되지 않는다.

Impact on quality of generated text
언어 모델이 생성하는 분포가 균일할 때 (최대 엔트로피), Green list의 randomness 로 인해 토큰이 균일하게 샘플링되며, perplexity 는 영향을 받지 않는다. 반면에 최소 엔트로피의 경우, 모든 확률 질량이 단일 토큰에 집중되기 때문에 soft watermark 는 효과가 없으며 역시 perplexity 에 영향을 미치지 않는다. 그러나 워터마크 규칙은 중간 엔트로피의 토큰에 대해 perplexity 에 영향을 미친다.

Private Watermarking

위의 워터마크 알고리즘은 Public 하게 설계되었다. 워터마크는 private 모드에서 운영될 수 있으며, 이 경우 알고리즘은 비밀로 유지되는 무작위 키를 사용하고 안전한 API 뒤에서 호스팅된다. 공격자가 Red list 를 생성하는 데 사용된 키에 대한 지식이 없으면 공격자가 워터마크를 제거하는 것이 더 어려워진다. 그러나 이제 워터마크의 존재 여부를 테스트하려면 동일한 안전한 API를 사용해야 하며, 이 API가 공개적이면 동일한 시퀀스의 소소한 변형을 사용하여 공격자가 너무 많은 쿼리를 하지 못하도록 액세스를 모니터링해야 한다.

Experiments

OPT-1.3B 모델(Zhang et al., 2022)을 사용하여 워터마크의 동작을 검증한다. 워터마크 강도를 측정할 때 Type-I error (Human text 가 watermarked 로 표시) 및 Type-II error(Watermarked text 가 감지되지 않음)의 비율을 사용한다.

Datasets and Prompts.
다양한 현실적인 언어 모델링 시나리오를 시뮬레이션하기 위해 C4 데이터셋의 뉴스와 유사한 subset에서 무작위로 선택한 텍스트를 적절하게 가공한다. 각 무작위 문자열에 대해 일정 길이의 토큰을 끝에서 잘라내어 “baseline” completion으로 취급한다. Remaining 토큰은 프롬프트로 사용된다. 더 큰 오라클 언어 모델 (OPT-2.7B) 은 generated completion 및 human baseline 의 perplexity(PPL)를 계산하는 데 사용된다.

Watermark Strength vs Text Quality

짧은 시퀀스에 대해 매우 강력한 워터마크를 얻으려면, 작은 Green list size $\gamma$ 와 큰 green list bias $\delta$ 를 선택해야 한다. 그러나 더 강력한 워터마크를 만들면 생성된 텍스트가 왜곡될 수 있다. 위의 Figure 2 (Left)은 다양한 워터마킹 매개변수 조합에 대한 워터마크 강도($z$-score)와 text quality (perplexity) 사이의 trade-off 를 보여준다. 각 매개변수 선택에 대해, 길이 T = 200 ± 5 토큰의 500 ± 10개 시퀀스를 사용하여 결과를 계산한다. 흥미로운 점은 작은 green list size $\gamma$ = 0.1이 pareto-optimal 이다.

이러한 quantitative result 에 추가로, 위의 Table 1에서 실제 프롬프트와 워터마크된 결과의 예시를 보여줌으로써 다양한 종류의 프롬프트에 대한 테스트 통계 및 품질 측정의 행동에 대한 질적인 감각을 제공한다.

Ironing in the Watermark with Beam Search.
Figure 2 (Right) 은 beam search 를 사용할 때 워터마크 strength 와 accuracy 간의 trade-off 를 보여준다. Beam search 는 soft watermark 규칙과 synergistic inetraction 을 보인다. 특히 8개의 beam 을 사용할 때, figure 점들은 거의 수직선을 형성하며, 강력한 워터마킹을 달성하는 데 거의 perplexity cost 가 없음을 보여준다.

Watermark Strength vs Number of Tokens.

Theory 는 시퀀스 길이 T 가 증가함에 따라 워터마크의 type-I and type-II error rate 이 감소해야 할 것을 예측한다. 위의 Figure 3 은 시퀀스 길이 T가 2에서 200까지 변할 때 측정된 평균 $z$-score를 사용하여 워터마크의 strength 를 보여준다. 다양한 $\delta$ 및 $\gamma$ 값에 대한 curve 에 대하여, 왼쪽 두 그래프는 multinominal 샘플링을 사용하며, 오른쪽 차트는 8-way beam search 를 사용하며 $\gamma$ = 0.25 다. 다시 한번, 8-way beam search 가 높은 green list rate 를 달성하는 데 얼마나 강력한지를 확인할 수 있다. Moderate bias $\delta$ = 2의 경우에도 35 토큰에서 5 이상의 평균 $z$-score 를 달성한다.

Performance and Sensitivity for Multinomial Sampling.

Observed $z$-score 를 기반으로 한 resulting hypothesis 의 sensitivity 을 보여주기 위해 Table 2에 다양한 워터마킹 매개변수에 대한 error rate 를 report 한다. 또한 Figure 4의 ROC 차트에서 여러 임계값 범위를 볼 수 있다.

Attacking the watermark

워터마크 및 워터마크 detector 를 구현할 때는 보안이 유지되도록 주의를 기울여야 한다. 그렇지 않으면 적대적인 사용자가 텍스트를 수정하여 Red list token 을 추가하여 detection을 피할 수 있다. 많은 경우에는 텍스트를 해시가 계산되기 전에 적절하게 정규화함으로써 간단한 공격을 피할 수 있다. 다음 섹션에서는 두 번째로 작은 언어 모델을 사용하여 대표적인 공격의 예를 실제로 구현하고 평가한다.

. Degradation Under Attack: Span Replacement Using a LM
다른 언어 모델을 사용하여 원본 출력 텍스트에서 일부 구간을 교체함으로써, 워터마크의 존재를 제거하려는 현실적인 블랙박스 공격을 연구한다. 워터마크 알고리즘을 API 뒤에 은폐된 것처럼 취급하여 이를 비공개로 간주한다. 공격자는 Green list token 의 위치에 액세스할 수 없으며 대신 특정 단어 교체 예산 ε 에 도달할 때까지 무작위 인덱스에서의 토큰 교체를 시도한다. 예산 제약은 원본 워터마크 텍스트와 공격된 텍스트 간의 수준의 의미 유사성을 유지하며, 그렇지 않으면 원본 텍스트가 의도한 작업을 수행하기 위한 효용성(utility) 가 손실될 수 있다. 또한 공격에서 각 구간 교체는 multi-million parameters 를 가진 언어 모델의 inference 를 통해 수행된다. 이는 대상 모델의 대략 1/3 크기 정도이지만, 공격이 실제에서 모델 호출에 대한 기본적인 효율성 수준을 유지하는 것이 바람직하다는 것을 의미한다. 실험에서는 대체 모델로 T5-Large 를 채택하고, 공격자가 예산에 도달하거나 더 이상 적절한 교체 후보가 반환되지 않을 때까지 토큰을 반복적으로 선택하여 교체한다.

T5 토크나이저를 사용하여 워터마크가 지정된 텍스트를 토큰화한다. 그 다음, εT번 미만의 성공적인 교체가 수행되었거나 최대 반복 횟수에 도달할 때까지 다음을 반복한다.

(1) 토큰화된 단어 중 하나를 로 무작위로 교체된다.

(2) 토큰 주변의 텍스트 영역을 T5에 전달하여 50-way beam search 를 통해 likelihood 에 대응하는 점수가 있는 k = 20개의 후보 교체 토큰 시퀀스 목록을 얻는다.

(3) 각 후보는 문자열로 디코딩된다. 모델이 반환한 k개의 candidate 중 하나가 마스킹된 영역에 해당하는 원래 문자열과 같지 않으면 공격이 성공하고 해당 영역이 새 텍스트로 교체된다.

이 방법으로 길이 T = 200±5 토큰 시퀀스의 세트를 500개 공격한 후에, 업데이트된 $z$-score를 계산하고 error rate 을 정리한 ROC 플롯이 Figure 5 이다. 이 공격은 텍스트 내의 Red list token 수를 증가시키는 데 효과적이지만, Figure 에 나타난 대로 ε = 0.1일 때 워터마크 강도의 감소만을 측정한다. ε = 0.3의 큰 예산에서 워터마크 제거는 더 성공적이지만, 공격된 시퀀스의 평균 perplexity 는 3배로 증가하며 더 많은 모델 call 이 필요하다.

Conclusion

The proposed method's z-statistic for detection relies solely on the green list size parameter γ and the hash function, independent of δ or other factors related to green list enforcement, allowing flexible deployment of watermarks with context-specific rules and the ability to change the sampling algorithm without altering the detector; however, open questions persist, such as optimal testing in streaming or mixed-context scenarios, leaving room for future research on the practicality of watermarks in countering malicious uses of generative models.

[ACL2023] Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

Mon, 11 Sep 2023 15:10:00 +0000

[pdf] [github]

Harsh Trivedi ¹, Niranjan Balasubramanian ¹, Tushar Khot ², Ashish Sabharwal ²
¹ Stony Brook University, Stony Brook, U.S.A. ² Allen Institute for AI, Seattle, U.S.A.

Abstract

(LLM and Weakness ) 최근 LLM 이 natural language reasoning 혹은 Multi-step QA 를 위한 Chain-of-Thought (CoT) 에 매우 강력한 성능을 보인다. 그러나, 이들은 necessary knowledge 가 unavailable 하거나, up-to-date 하지 않은 경우 parameter 속에 그 것을 가지고 있기 힘들다.
(One-step retrieval and Weakness ) 이에 따라 최근, external knowledge 로 부터 relevant text 를 retrieve 해서 활용하는 one-step retrieve-and-read approach 가 연구되었지만, 이는 multi-step QA 를 풀기에는 부족하다.
(IRCoT) 이에 저자들은 what to retrieve 는 what has already been derived 에 depend 한다는 점에 착안하여, CoT 에 retrieval 을 interleave(끼우는) 하는 IRCoT 를 제안한다.
(Experiment) IRCoT 를 GPT-3 에 적용하였을 때, retreival 성능이 매우 향상되었으며, downstream QA dataset 4 개: HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC 에 대하여도 매우 큰 성능 향상을 보인다. 추가적으로, out-of-domain(OOD) setting 에서도 smaller model 에 적용했을 때 매우 좋은 성능을 보인다.

Introduction

최근 Large Language Model (LLM) 은 complex question 에 대하여 step-by-step 으로 natural language reasoning 을 하는, 이른바 Chain-of-Thoughts (CoT)https://openreview.net/pdf?id=_VjQlMeSB_J를 수행할 수 있다. 이 접근법은 질문에 답을 할 수 있는 모든 정보가 parameter 내에 존재해야만 적용가능하다. 그러나 많은 open-domain quesition 에 대하여, 대부분의 required knowledge 는 model 의 parameter 속에 존재하지 않는다.([1], [2])
How can we augment chain-of-thought prompting for open-domain, knowledge-intensive tasks that require complex, multi-step reasoning?

one-shot retrieval 을 통해 LM 을 augment 하는 방법이 relevant knowledge 를 활용하고 많은 factoid(뇌피셜) task 를 해결하였지만 ([3], [4], 이 방법들은 complex multi-step reasoning question 들을 푸는데는 분명한 한계점이 존재한다. 몇몇의 question 에 대하여 이러한 방법들은 partial knowledge 를 추출하거나, partial reasoning 을 수행하거나, partial reasoning 에 의한 outcome 에 필요한 additional 정보를 추출해오고 반복한다. 예를 들어, 위의 Figure 1 에서 “In what country was Lost Gravity manufactured?” 라는 질문에 대하여 한 번만 retrieval 해 올 경우, Mack Rides 라는 company 까지는 가져오지만 어느 나라인지는 가져올 수가 없다.

따라서, retrieval 과 reasoning step 은 반드시 함께 가야한다. retrieval 없이는 model 은 incorrect reasoning step 을 할 수 밖에 없어 hallucination 이 발생한다. 마찬가지로, first reasoning step 을 거치지 않으면, second step 이 identify 되지 않는다. 다시 말해, 우리는 correct reasoning step 을 위해 retreived fact 가 필요하고, relevant fact 를 retrieve 하기 위해 reasoning step 이 필요하다.

이 intuition 을 통해 저자들은 Interleaving Retrieval to CoT (IRCoT) 를 제안한다. Figure 1 이 IRCoT 의 overview를 잘 나타낸다. 우선, question 을 query 로 하여 base paragraph set 을 retrieval 한다. 이후, (i) extent CoT : question, 지금까지의 paragraph, 그리고 지금까지 생성된 CoT sentence 를 통해 다음 CoT sentence 를 생성하고, (ii) exapnd retreived information : 마지막 CoT sentence 를 통해 최종적으로 information retrieval 을 해와 collected set 을 구성한다. CoT 문장이 정답을 추출하거나, maximum allowed number of reasoning step 이 될 때 까지 이 행동을 반복하다가, termination 과 함께 collected paragraph 가 retrieval outcome 으로 함께 나오고, 이 것들을 모두 context 로 활용하여 QA prompting (GPT-3) 혹은 CoT prompting (Zero-shot CoT)을 통해 결과를 도출한다.

4 개의 multi-step reasoning dataset 인 HotpotQA, 2WikiMultihopQA, MusSiQue, 그리고 IIRC 에 대해 code-davinci-002 를 활용하였을 때, 매우 큰 성능향상을 보인다. 또한, Flan-T5 11B, 3B, 700M 같은 작은 모델에 대하여도 비슷한 성능을 보인다. 특히, Flan-T5-XL (3B) 모델에 대하여, IRCoT 를 적용할 경우, 58배 큰 GPT-3 with one-step retrieval 방법보다 더 좋은 성능을 보인다. 게다가, 이 성능향상은 out-of-distribution (OOD) 에서도 같은 경향을 보인다. 마지막으로, 최근 few-shot open-domain QA (ODQA) 의 그 어떤 모델들 보다도 훨씬 QA score 가 높다. (DecomP, Self-ask, ReAct)

Chain-of-Thought-Guided Retrieval and Open-Domain QA

Goal 은 Knowledge-intensive multi-step reasoning question Q 를 few-shot setting 으로 해결하는 것이다.
이를 위해 retreive-and-read paradigm 을 활용한다. 이는 retriever 가 먼저 knowledge source 로 부터 document 를 retrieval 해온 뒤, QA model 이 answer 를 생성한다. IRCoT 방법론은 주로 retrieve step 에 치중되어 있고, read step 에서는 standard prompting startegy 를 활용한다.

Interleaving Retrieval with Chain-of-Thought Reasoning

IRCoT 은 세 가지로 구성되어 있다. (i) base retriever : query 를 받아 knowledge source 로 부터 paragraph 를 추출한다. (ii) zero/few-shot CoT 가 가능한 LLM (iii) reasoning step 을 통해 answer 에 도달할 수 있는 annotated CoT question 들이다. 우선, 위의 그림처럼 base retriever 가 query Q 를 통해 K 개의 paragraph 를 retrieval 해 온다. 이후 reason 과 retrieve 라는 two step 를 iteratively interleave 한다. (termination criterion 이 될 때 까지)

Retireval-guided reasoning step (““REASON””) 은 question, 지금까지 추출된 paragraph 그리고 지금까지의 CoT 문장들을 통해 next CoT 문장을 생성한다. prompt 는 아래와 같다.

In-context learning (ICL)을 위해 위의 full prompt 를 demonstration 으로 활용하고, Test (inference) 과정에서는 CoT 를 채워나가게 한다. Reason-step 에서 여러 문장이 생성될 수 있지만, 첫 번째 문장만 취하고 나머지는 버린다. ICL demonstration 을 위한 full prompt 에서 paragraph 는 ground-truth 를 하나 넣은 후 M 개의 randomly sampled paragraph 를 concat 한다. Test instance 에서는 모든 paragraph 를 활용한다. 만약, 생성된 CoT 문장이 “answer is “ 로 시작하거나, 지정해놓은 maximum number step (8) 에 도달하면, process 를 종료하고, 모든 retrieval result 를 return 한다.

CoT-guided retrieval step(““RETRIEVE””) 는 마지막에 생성된 CoT 문장을 query 로 하여 paragraph 를 추출하고, 이 것을 collected set 에 추가한다.

QA model

최종적으로 retrieval 되어온 collected set 과 question 을 활용하여 QA reader 가 answer 를 추출한다. 두 가지 잘 알려진 QA prompting 을 활용하는데, 첫 번째는 CoT prompting (zero-shot/few-shot CoT) 이고 두 번째는 GPT-3 prompting 이다. CoT prompting 은 위에서 봤던 prompt 와 동일하며, 만약 마지막 CoT 문장이 “answer is…” 였다면 programmatically 정답이 추출된다. 그렇지 않다면, full generation 을 통해 answer 를 return 한다. GPT-3 prompting 에 대하여는 CoT prompt 전체를 answer field (“A: “) 으로 대체한다.

Experimental Setup

4 개의 데이터셋 : HotpotQA, 2WikiMultihopQA, answerable subset of MuSiQue, 그리고 answerable subset of IIRC 를 통해 open-domain multi-step QA 를 평가한다. HotpotQA 에는 Knowledge source 로 Wikipedia 를 활용하였고, 나머지는 원래 associated 된 knowledge source 를 활용한다.

Models

Retriever 로는 Elasticsearch 에 구현되어 있는 BM25 를 활용한다. 아래의 두 가지 retriever system 을 비교한다: (i) One-step Retriever (OneR) : question 을 query 로 하여 K 개 paragraph 를 추추랗ㄴ다. K 는 {5,7,9,11,13,15} 에서 고른다. (ii) IRCoT Retriever : CoT generator 모델로는 OpenAI GPT3 (code-davinci-002) 와 Flan-T5-*를 활용한다.

In-context demonstration 을 위해서 각 데이터셋 마다 20 개의 CoT question 을 작성하였고, 그 중 15 개를 sampling 하여 3 개의 training demonstration 을 만들었다. 모든 실험에서, dev set 에서 best hyperparamter 를 찾은 뒤 test set 에 대하여 실험하였다. Test 단계에서는 최대한 많은 demonstration 을 pack 하여 입력으로 하였고, GPT-3 의 경우 8K word piece limit 을 전부 활용하고, Flan-T5-* 의 경우 GPU 용량 (80G A100) 문제로 6K word piece 를 활용한다. IRCoT retriver 은 K 는 {2,4,6,8} 에서 고르고, M 은 {1,2,3} 에서 고른다.

Retriever Metric 으로는 최종적으로 추출되는 15 개의 paragraph 에 대하여, gold paragraph 에 대한 recall 값을 측정한다. dev set 에서 recall 이 최대가 되게 하는 K 를 고르고, test set 에서 그 K 값을 활용한다.

QA Reader 로는 reason-step 에서 활용한 LM 과 같은 LM 을 활용한다. Flan-T5-* 의 경우, direct prompting strategy 가, GPT3 의 경우 CoT prompting 이 더욱 효과적이었다. 따라서, Flan-T5-* 로 QA 할 때는 Direct prompting 을, CoT with GPT3 로 QA 할 때는 CoT prompting 을 활용하였다.

Open-domain QA (ODQA) model : 최종적으로 비교대상이 되는 ODQA model 은 다음과 같다. OneR QA, IRCoT QA, 그리고 retrieve-less QA reader 인 NOR QA 을 통해 closed-book 으로 LM 이 얼마나 잘하는지 본다.

Results

IRCoT retrieval is better than one-step.

Figure.3 에서 Retreival Recall 값에 대해Flan-T5-XXL 과 GPT3 에 대해 OneR 과 IRCoT 를 비교한다. 두 모델에 모두 IRCoT 가 확실한 성능 우위를 가져간다.

IRCoT QA outperforms NoR and OneR QA.

Figure.4 에서 ODQA performance 를 NoR, OneR, 그리고 IRCoT 를 비교한다. 마지막 IIRC 에서의 GPT3 모델을 제외하고 IRCoT 가 성능향상을 보였는데, Figure.3 에서 21 점이나 앞선 것에 대비하면, 놀라운 결과이다. 그 이유는 바로 GPT3 의 학습에 이미 IIRC relevant knowledge 가 존재하기 때문이다.

IRCoT is effective in OOD setting.

CoT 가 new dataset 에 대해서 항상 잘하는 것은 아니기 때문에, NoR, OneR, IRCoT 에 대해서 OOD setting 에 대해서 실험을 한다. OOD setting 을 위해 prompt demonstration 을 하나의 dataset 에 대해서 진행하고, 나머지 dataset 으로 evaluate 한다. Figure.5 의 Recall 값과 Figure6. 에서 Answer F1 에서 모든 경우에 대해서 같은 경향성으로 IRCoT 가 우세하다.

IRCoT generates CoT with fewer factual errors.

Generated CoT 의 factuallity 를 assess 하기 위해, 40 개의 radnomly sampled question 을 통해 factual error 를 검사해보았다. Figure.7 에서 볼 수 있듯이, NoR 이 가장 많은 factual error 를 보였고, OneR 은 더 적은, IRCoT 는 가장 적은 error 를 보였다. 아래에서 정성적인 결과 (Qualitative Result) 를 볼 수 있다.

IRCoT is also effective for smaller models.

Smaller model 에 대한 IRCoT 의 성능은 위에서 볼 수 있다. 심지어 Figure.9 에서 IRCoT 의 3B 모델이 58배 큰 GPT3 모델 OneR 혹은 NoR 보다 훨씬 강력하다.

IRCoT is SOTA for few-shot multistep ODQA.

Different Method 와 Different API 가 활용되었기 때문에, Apple-to-Apple 비교는 어렵지만, 그럼에도 불구하고 IRCoT 가 기존의 DecomP, ReAct, Self-Ask 같은 State-of-the-Art 모델보다 훨씬 좋은 성능을 보였다.

Conclusion

Chain-of-thought prompting has significantly improved LLMs’ ability to perform multi-step reasoning. We leveraged this ability to improve retrieval, and in turn, improve QA performance for complex knowledge-intensive open-domain tasks in a few-shot setting. We argued that one-step questionbased retrieval is insufficient for such tasks, and introduced IRCoT, which uses interleaved CoT reasoning and retrieval steps that guide each other step-by-step. On four datasets, IRCoT significantly improves both retrieval and QA performance when compared to one-step retrieval, for both large and relatively smaller-scale LMs. Additionally, CoTs generated by IRCoT contain fewer factual errors.

[ACL2023] FutureTOD: Teaching Future Knowledge to Pre-trained Language Model for Task-Oriented Dialogue

Sat, 19 Aug 2023 04:00:00 +0000

[pdf] [github]

Weihao Zeng^*1, Keqing He^*2, Yejie Wang ¹, Chen Zeng ¹, Jingang Wang ², Yunsen Xian ², Weiran Xu^*1
¹ Beijing University of Posts and Telecommunications, Beijing China, ² Meituan, Beijing, China

Abstract

(Motivation) PLM 이 NLP scenario 에서 큰 성공을 거두고 있지만, 일반적인 text 학습과 task-oriented dialog 학습의 intrinsical 차이로 practically less useful 하다.
최근의 dialog pretraining 방법은 contrastive framework 에 의존하지만, positive 와 hard negative 를 selecting 하는데 어려움을 겪고 있다.
(FutureTOD) 이 논문에서는 previous dialog context 에서 self-training 기법을 활용하여 future knowledge 를 distil 하는 FutureTOD 를 제시한다.
(Intution) 이 것은 좋은 dialog representation 은 local context information 을 학습함과 동시에 future info 를 predict 할 수 있어야 한다는데서 intuition 을 얻는다.
FutureTOD 는 성능면에서 우수하고, 특히 generatlization 과 robustness 에서 우수하다.

Introduction

[ACL2022] An Interpretable Neuro-Symbolic Reasoning Framework for Task-Oriented Dialogue Generation

Tue, 28 Feb 2023 10:21:00 +0000

[pdf] [github]

Shiquan Yang¹, Rui Zhang², Sarah Erfani¹, Jey Han Lau¹
¹ The University of Melbourne ,
² www.ruizhang.info

Abstract

(Motivation) Task-oriented dialouge system 의 interpretability 에 대한 연구가 필요하다.
(Method) Transparent reasoning process 를 얻기 위하여, explicit resaoning chain 을 통한 neuro-symbolic 을 소개한다.
(Limitation) 기존의 neuro-symbolic 방법은 one-phase design 으로 인해 multi-hop reasoning 과정에서 error-propagation 이 있다.
(Solution) 이를 해결하기 위하여 Hypohesis generator 와 Reasoner 의 two-stage approach 를 택한다. 우선, hypothesis generator 를 통해 multiple hypotheses 를 얻고, 이후 reasoner 에 의해 평가되어 최종적으로 final prediction 을 위해 하나의 hypothesis 가 선택된다. 모든 과정은 별도의 reasoning chain annotation 없이 텍스트만을 통해 이루어진다.
(Experiment) 두 public benchmark 에 대하여 좋은 성능을 얻었을 뿐 아니라, interpretable decision process 를 얻었다.

Introduction

Task-Oriented Dialogue System (TOD) 은 눈부시게 발전하고 있지만, deep learning 의 black-box 적인 특성 때문에, explainability 를 갖추고 있지 못하다. 이러한 implicit reasoning 특성 때문에, 만약 knowledge base (KB) 에서 잘못된 추론을 통해 잘못된 정보를 가지고 올 때, 어디서 어떤 문제가 발생했는지를 알 수 없다. 본 논문에서는, interpretable KB reasoning 을 통해 useful information 을 제공할 뿐 아니라, interpretability 도 갖추는 연구를 제안한다. 이를 위하여 Neuro-Symbolic Dialogue framework (NS-Dial) 을 제안한다. NS-Dial 은 neural network 의 representation capacity 와 symblic approach 의 explicit reasoning 을 combine 한 novel 한 방법론이다. 기존의 Neuro-symbolic 방법[1][2]은 pre-diefined human interpretable neural module 로 구성된 tree-structued program 을 통해 final prediction 을 얻는 one-pahse procedure 이다. 그러나 KB resoning task 의 경우, reasoning process 가 multiple triplet 에 걸쳐 diverse 하게 spanning 되기 때문에, 이러한 one-phase 구조는 error-propagation 이 되기 쉽고, sub-optimal 한 결과를 얻게 된다.

이에 저자들은 two-phase procedure 를 통해 이 error propagation의 효과를 경감시킨다. 첫 번째로, multiple hypotheses 를 생성한 후, 이 것을 평가하여 final prediction 을 위한 final hypothesis 를 고른다. 여기서 hypothesis 는 dialogue context 에서 언급된 entity, KB 속의 entity, 그리고 그 사이의 관계로 이루어진 triplet 의 형태 이다. 이 중 groun truth triplet 은 ground-truth response 에 언급된 entity 를 포함한다. 예를 들어, 위의 그림에서, hypothesis generator 는 “Cityroom, located_in, Leichhardt” 와 “Gonville_hotel, Located_in, Leichhardt” 의 두 triplet 을 생성할 것이고, reasoner 가 proof tree를 통해 reasoning chain 을 통하여 “Cityroom, located_in, Leichhardt” 가 valid 한 triplet 임을 확인할 것이다. 이 과정은 end2end 로 raw dialogue 만을 통해 이루어지고, 어떠한 additional intermediate label 도 필요로 하지 않는다.

preliminary

본 연구에서는 KB 를 통해 dialogue response generation 에 focus 한다. Dialogue history $X$ 와 knowledge base $B% 가 주어졌을 때, system response $Y$ 를 word-by-word 로 생성한다.

$y_t$ 는 response $Y$ 의 t-th token 이다.

전체적인 모델 그림은 위와 같고 우선 standard module 을 살펴 본 뒤 두 가지 novel module 을 살펴본다.

Dialogue Encoding
우선 BERT 를 통해, dialogue history 에 대한 distributed representation 을 얻는다. [CLS] token 을 history token 들 앞에 추가한 뒤 , input token $X = ([CLS],x_1,…,x_M)$ 에 대하여, hidden state $H_{enc} = (h_{CLS},h_1,…,h_M)$ 는 아래와 같이 계산된다.

Response Generation
우선 linear layer 를 통해 decoder dimension 으로 dialogue history 의 BERT embedding hidden state 을 맞춰준다. $H\prime_{enc} = (h\prime_{CLS},h\prime_1,…,h\prime_M)$ 은 decoder dimension 으로 projected 된 hidden state 이다. 이 $h\prime_{CLS}$ 을 통해 decoder 를 시작하여, $h_{dec,t}$ 를 얻고, 이것과 $H\prime_{enc}$ 를 attention 하여, $h\prime_{dec,t}$를 얻는다. 이후 이를 concat 하여 context vector C 를 얻고, vocabulary space $V$ 로 project 한다.

이후, KB distribution $P_{kb,t}$ (KB 속의 entity 들의 probability distribution) 를 interpretable way 로 estimate 하기 위하여, $P_{vocab,t}$ 와 $P_{kb,t}$ 를 fuse 한 뒤 final output token 을 생성한다. See et al. 을 따라, soft-swtich mechanism 을 통해 두 확률 분포를 fuse 하여 $y_t$ output token 을 생성한다. 구체적으로는 generation probability $p_{gen} \in [0,1]$ 을 아래와 같이 계산한 뒤,

아래의 수식을 통해 probability distribution $P(w)$ 를 만들어, greedy sampling 을 통해 $y_t$ 를 생성한다.

이제 가장 중요한 $P_{kb,t}$ 를 어떻게 얻는지, 두 가지 novel 한 모듈 (1) hypothesis generator, (2) reasoner 을 통해 설명한다.

Neuro-Symbolic Reasoning for Task-Oriented Dialogue

KB distribution $P_{kb,t}$ 를 얻기 위하여 Hypothesis generator (HG) 와 hierarchical reasoning engine (HRE)를 구성한다. HG module 의 input 으로 위에서의 context vector C 를 input 으로 사용하고, K 개의 hypotheses 로 이뤄진 집합 $HYP$ 를 얻는다. 각각의 hypothesis 들은 HRE 에 feed 되어 logical reasoning chain 을 생성하고 belief score 가 매겨진다. 계산된 belief score 가 $P_{kb,t}$ 로써 제공되고, KB 의 eneity 들에 대한 distribution 이 된다.

Hypothesis Generator
Hypothesis triplet “[H,R,T]” 에 대해서 H 는 Head entity, T 는 Tail entity, R 은 relation 이라고 표기할 때, 세 가지 type 의 hypothesis 를 고려한다 : H-hypothesis, R-hypothesis, T-hypothesis. H-hypothesis 는 R 과 T 는 dialogue context 에서 infer 될 수 있지만, H 가 unknown 이라 KB 에서 추출되어야 하는 경우이다. 따라서 H-hypothesis 는 “[▷,R,T]” 의 형태를 갖는다. R-hypothesis, 와 T-hypothesis 는 이와 같은 줄기를 갖는 형태이다. HG 는 여러 ▷을 채우는 hypothesis 들을 생성하여 HRE 로 넘긴다.

직관적으로, hypothesis 은 content 와 sturcutre 로 부터 결정되어야 한다. structure 는 hypothesis 의 template form 을 말하고, content 는 template 을 채우는 것을 말한다. 예를 들어, H-hypothesis 라면 “[▷,R,T]” 형태를 갖게 될 것이고, content 는 candidate entity “▷” 과 query state “R”, “T” 로 이루어진다. 이를 위해 저자들은 divide-and-conquer 방식을 택해, structure prediction, querty state prediction, 그리고 candidates prediction 세 가지 sub-component 를 구성한다.

Structure Prediction (SP)
SP 의 궁극적인 목표는 H/T/R-hypothesis 중 어떤 hypothesis 인지 결정하는 것이다. context vector C 를 input 으로 shared transformeration layer 를 거친 뒤, task-agnostic fature 인 $h_share$ 는 아래와 같이 구성된다.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen image Encoders and Large Language Models

Mon, 27 Feb 2023 06:56:00 +0000

[pdf] [github] [huggingface]

Junnan Li^‡, Dongxu Li^‡, Silvio Savarese^‡, Steven Hoi^‡
^‡ Salesforce Research

Abstract

(Motivation) Vision-and-Langauge Transformer 의 스케일이 커지면서 pre-training 이 너무 힘들어졌다.
(Method) 이 논문에서는 off-the-shelf frozen pre-trained image encoder 와 frozne LLM 을 활용하여 BLIP-2 라는 efficient 한 pre-training strategy 를 소개한다.
(Method) BLIP-2 는 lightweight Querying Transformer 를 활용하여 modality gap 을 bridge 한다.
(Method) 두 가지 step 으로 이뤄져 있는데, 첫 번째 step 은 frozen image encoder 로 부터 vision-language representation learning 을 bootstrap 하고, 두 번째 step은 frozen LLM 을 통해 vision-to-language generative learning 을 bootstrap 한다.
(Experiment) BLIP-2 는 여러 vision-and-language task 에서 State-of-the-Art 를 기록하였고, 특히 zero-shot VQAv2 에서는 flamingo80B 를 54배 적은 parameter로 8.7%의 성능을 추월하였다.

Introduction

Vision-and-Language Pretraining(VLP) 가 최근 눈부신 성장을 보여주고 있지만, pre-training 과정에 large-scale model 과 dataset 을 필요로 한다. Vision-and-Language model 은 각각 발전된 unimodal model 로부터 성능을 뽑아내는 것이 자연스럽다. 이 논문에서는 off-the-shelf pre-trained vision model 과 language model 을 bootstrapping 하는 generic 하고 compute-efficient VLP 방법을 소개한다. Pre-trained vision model 은 high-quality visual representation 을 제공한다. Pre-trained language model (LLM)은 strong language generation 과 zero-shot transfer ability 를 제공한다.

Pre-trained unimodal model 들을 VLP 에 활용하기 위해서는 cross-modal alignment 가 필수적이다. 그러나 LLM 의 경우, pre-training 과정에서 image 를 전혀 관측하지 않기 때문에, LLM 을 freezing 하는 것은 vision-language alingment 를 더욱 어렵게 만든다. 이러한 관점에서 기존의 Frozen 이나 Flamingo 와 같은 iamge-to-text generation loss 는 modality gap 을 줄이는데 사용하지만, 본 논문에서 loss 만으로는 insufficient 하다는 것을 검증한다.

이 논문에서는 이 문제점을 해결하기 위하여 Effective Vision-language alignemnt 를 위해서, Qerying Transformer (Q-former) 를 제안한다. 이 Q-Former 는 위의 그림에서와 같이, frozen image encoder 로부터 visual feature 를 추출하기 위하여 learnable query vector 를 추출한다. 이 것은 Frozen Image encoder 와 frozen LLM 사이의 information bottleneck 역할을 하는데, desired text 를 생성하기 위해, most useful visual feature 를 뽑아내는 역할을 한다. 첫 번째 pre-training stage 에서 Q-former 가 text 와 가장 관련된 visual representation 이 무엇인지 배우도록 학습한다. 이후 두 번째 pre-training stage 에서, Q-former 와 LLM 을 결합하여 vision-to-language generative learning 을 수행하여, Q-former 의 output 이 LLM 에 의해 해석될 수 있도록 학습한다.

Method

Model Architecture

Frozen Image Encoder 와 Frozen LLM 사이의 gap 을 bridge 하기 위하여, trainable module 인 Q-Former 를 도입한다. Q-Former 는 input image resolution 과 상관없이, fixed number 의 output feature 를 뽑아낸다. 위의 그림처럼, Q-former 는 self-attention layer 를 share 하는 두 개의 transformer layer 로 구성된다: (1) frozen image encoder 로 부터 visual feature extraction 을 위해 사용되는 transformer, (2) text encoder 와 text decoder 의 역할을 하는 text transformer. Q-former 의 self-attention layer 는 pre-trained BERT 를 활용하였고, layer 마다 inject 되는 cross-attention layer 는 randomly initialized 되었다. Q-former 는 188M의 parameter 로 이루어져있고, Query vector 역시 model param 이다. 실험에서는 768 차원을 갖는 32 개의 query vector (Z)을 사용하였다. Vit-L/14 에서 사용하는 frozen image feature 257x1024 에 비하면, 32x768 의 크기를 갖는 Z 는 크기가 매우 적은 편이다. 이 query vector 는 text 와 가장 relevant 한 visual information 을 extract 하는데 사용된다.

이후, BLIP 에서 영감을 받아, 세 개의 pre-training objective 를 jointly optimize 한다. 세 개의 pre-training objective 는 위의 그림과 같이 각기 다른 attention masking strategy 를 통해 이루어진다.

Image-Text Contrastive Learning (ITC)
ITC 는 image representation 과 text representation 사이의 mutual information 을 극대화한다. Postivie pair 를 negative pair 들과 contrasting 하여 구현한다. [CLS] token 의 output embedding 인 text representation t 와 Z 을 align 한다. 32 개의 query 중 가장 높은 iamge-text similarity 를 갖는 것을 고른다. infromation leak 을 피하기 위해, unimodal self-attention 을 차용하는데, 위의 그림에서와 같이 attention masking 을 활용하여 서로가 서로를 allow 할 수 없다.

Image-grounded Text Generation (ITG)
ITG 는 Q-former 가 given input image 로 부터 text 를 generation 하도록 학습한다. UniLM 과 비슷하게 causal self-attention mask 를 통하여 구현한다.

Image-Text Matching (ITM)
ITM 은 image and text preresentation 사이의 fine-grained alignment 을 위해 사용된다. image-text pair 를 잘 찾아내는 지의 binary classification 으로 구성된다. bi-directional self-attnetion mask 를 통해 모든 query 와 text 들이 서로를 attend 할 수 있다.

Bootstrap Vision-to-Language Generative Learning from a Frozen LLM

Frozen LLM 과 Q-former 를 통해 generative pre-training stage 를 거친다. 이는 LLM 와 generative language capability 를 harvest 하기 위함이다. 위의 그림처럼, FC layer 를 통해 Z의 Output embedding 을 LLM 의 text embedding 으로 linearly proejction 한다. 이후 projected query embedding 이 input text embedding 에 prepend 되어 사용된다. 이 것은 마치 soft visual prompts 로써의 역할을 한다. 그림과 같이 decoder-only model 과 encoder-decoder model 을 각각 LM loss 와 prefix LM loss 를 통해 학습한다.

Pre-training data
BLIP 과 같이 COCO, Visual Genome, CC3M, CC12M, SBU, 그리고 LAION400M 의 115M image 를 포함한 129M image 를 사용한다. CapFilt 방법을 통해 web image 로 부터 synthetic cpation 을 생성한다. 정확히는, $BLIP_{large}$ captioning model 을 통해, 10 개의 caption 을 생성한 후, CLIP ViT-L/14 를 이용하여 original caption 과의 similarity 를 측정하여 reranking 한 후 사용한다.

Pre-trained image encoder and LLM.
Pre-trained image model : (1) VIT-L/14 CLIP, (2) ViT-G/14 EVA_CLIP
Pre-trained LLM : (1) OPT for decoder-only, (2) FLanT5 for encoder-decoder based LLM

Experiment

위의 표와 같이 BLIP-2 는 적은 파라미터로도 zero-shot setting 에서 압도적으로 좋은 성능을 보여준다.

Instructed Zero-shot Image-to-Text Generation

BLIP-2 는 LLM 으로 하여금 image 를 잘 이해할 수 있게 만든다. 위의 그림에서 예시를 볼 수 있다. Zero-shot VQA 에서, OPT 를 활용할 경우 prompt 로 “Question: {} Answer”,를 FlanT5 를 활용할 경우, “Question: {} Short Answer:” 를 활용한다.

위의 표에서 와같이 BLIP-2 는 VQAv2 와 GQA 에서 압도적인 성능을 보여준다. 특히 VQAv2 에서 Flamingo80B 를 8.7% 나 앞섰으며, 54 배 적은 param 수로 얻은 결과이다.

위의 그림은 Pre-training stage1 의 영향력을 보여준다. 두 방식의 LLM 에서 모두 pre-training stage1 으로 query 에 visual information 을 학습시킬 때 좋은 결과를 얻었다.

Image Captioning

Image Captioning prompt 로는 “a photo of”를 사용하였고, COCO 로 finetuning 학습을 한 후, NoCAPs 로 zero-shot 실험을 한 결과와 COCO test set 으로 한 결과는 위의 표와 같다. BLIP-2 는 out-of-domain image captioning 에서 매우 좋은 성능을 보여준다.

Visual Question Answering

Annotate VQA data 가 주어졌을 때, LLM 은 frozen 하고 Q-former 만을 finetune 하여 VQA 를 학습한다. 위의 표에서와 같이 open-ended generation dmoel 에서 state-of-the-art 를 달성한다.

[NAACL2022] Database Search Results Disambiguation for Task-Oriented Dialog Systems

Wed, 21 Dec 2022 00:46:00 +0000

[pdf] [papers with code]

Kun Qian^†, Satwik Kottur^‡, Ahmad Beirami^‡, Shahin Shayandeh^‡, Paul Crook^‡, Alborz Geramifard^‡, Zhou Yu^†, Chinnadhurai Sankar^‡
^† Columbia University, ^‡ Meta AI

Abstract

(Motivation) 현재 Task Oriented Dialogue (TOD) 에서 여러 database search result 에 대해 하나의 결과만을 제시함.
(New Task) 이에 저자들은 두 가지 database search result 를 제시하는 Database Search Result (DSR) Disambiguation 이라는 새로운 task 를 제시함.
(Solution) Pre-defined grammar 를 이용하여 turn 을 synthetically generate 하고, human paraphrasing 을 이용해 dataset 을 구성
(Experiment) 만들어진 데이터셋으로 Augmentation 한 결과, 그렇지 않은 경우보다 multiple database search result 를 disambiguation 하는 문장에 대해 좋은 성능을 보였고, 제시하는 연구를 통해 user exprience 를 enhancing 하는 것을 확인함.

Introduction

Task-Oriented Dialogue System (TOD) 은 Siri, Google Assistant 같은 virutal assistant 를 위하여 활발히 연구가 진행되고 있다. 이들은 user 와 대화를 이어가며 constraint 를 좁혀가다가, database search result 를 통해 entity 를 제시한다. 그러나, 위의 그림과 같이 현존하는 TOD System 들은 database search result 가 여러 개일 때도, 단 하나의 entity 만을 제시한다. 이러한 것을 저자들은 database search result ambiguity (DSR-ambiguity) 라고 정의한다.
이러한 ambiguity 를 해소하는 방법에는 두 개의 step 이 필요하다. 첫 번째는 clarification question 을 질문하는 step, 그리고 두 번째는 user의 corresponding answer 를 이해하는 step 이다. 첫 번째에 관한 연구는 많이 이뤄지고 있지만, 두 번째 asnwer/intent 를 understanding 하는 연구는 거의 이뤄지지 않고 있다. 이에 저자들은 MultiWoz 와 SGD 를 augmentation 하여 두 번째 step 에 대한 성능 향상을 도모한다.
MutliWoz 와 SGD 는 많은 State-of-the-Art TOD system 에서 사용되는 dataset 들이지만, 66% 에 해당하는 dialogue 에서 database search result ambiguity 가 발생한다. 그러나 모든 대화에서 여러 db search result 중 하나를 pick 하여 제시한다. 모든 result 를 제시할 필요는 없지만 2 개에서 3 개의 option 을 제시하는 것은 user 의 engagment 를 크게 도와준다. 저자들은 disambiguation turn 을 포함하는 SIMMC 2.0 dataset 에서 template 을 추출하고, MultiWoz 와 SGD 에서 database 를 추출하여, disambiguation 에 해당하는 1-turn dialogue dataset 을 생성하여 실험한다. 이후, reality 로의 application 을 위하여, 이 것을 MultiWoz 와 SGD 에 augmentation 한 후, model 에 학습시킨다.
저자들이 정리한 contribution 은 아래와 같다.

We propose Database Search Result Disambiguation , a new dialog task focused on understanding the user’s needs through clarification questions.
We provide a generic framework for augmenting disambiguation turns, and apply this framework to augment the two most popular task-oriented dialog datasets with disambiguation cases. We also conduct human paraphrasing for the augmented utterances in test sets.
We create a benchmark for the new task with pre-trained GPT2 model. The results show that our augmented dataset enhances the model’s disambiguation ability, while maintaining the performance on the original tasks

Task Formulation

저자들은 New task ; disambiguation in dialog database search result 를 제시한다. 위의 그림과 같이, dialog context $c$ 와 optional result 를 포함하는 system response $s$ 그리고 user uttr $u$ 에 대하여, task 의 target 은 user 에 의해 선택된 result 에서 entity 를 추출하는 것 이다.

Dataset

MultiWoz 와 SGD 는 disambiguation task 를 위한 case 들을 포함하고 있지 않기 때문에, 세 가지 step 을 통해 두 데이터셋을 augmentation 한다.
3.1 Synthesizing Single-Turn Dialog

위의 그림과 같이, synthetic 한 single-turn dialog 를 우선 생성하여, 모델로 하여금 disambiguation turn 을 학습하게 한다. 앞으로, 이러한 형태로 disambiguation turn 이 다뤄진다. System reponse $s$ 에는 파란 색으로 여러 가지 option 이 제시되며, user utterance $u$ 에서 빨간 색으로 선택한 result 의 entity 가 제시된다. 모델을 user utterance $u$ 에서 entity name 을 추출하는 것이 목적이다.
이러한 synthetic turn 을 만들기 위하여 SIMMC 2.0 dataset 에서 template 을 생성한다. SIMMC 2.0 dataset 에는 “do you mind being a bit more precise about which shoes you’re curious about, the red one or the blue one” 와 같이 ambiguity 를 solve 하는 turn 이 존재한다. 저자들은 이 utterance 에서 domain-related token (ex “shoes”, “the red one”, “the blue one”) 을 delexicalize 한 후, template 을 생성한다. 이후, template 으로부터 Context-free Grammar (CFG) 를 추출한 후, 이것을 통해 turn 을 생성한다. CFG 의 결과물은 “SENT-> do you mind VERBING”, 과 같다. CFG 는 이론상 2 백만 개의 system utterance $s$ , 그리고 3 만개 이상의 user utterance $u$ 를 생성할 수 있어, diversity 가 보장된다. 그리고 MultiWoz 와 SGD 의 여러 domain 에서 entity 들을 추출한 후, CFG 에 삽입하여 synthetic turn 을 만들어낸다. Natural 한 utterance 를 위하여 option 은 최대 5개 까지로 제한한다.
추가적으로 user utterance $u$ 를 어렵게 만들기 위하여 , Positional Addressing, Partial Addressing, Addressing with Typo, Multiple Addressing, Addressing with Attributes 5 가지 방법을 활용한다.

3.2 Automatic Augmentation
3.1 에서의 single-turn 만 학습해서는 complete dialog 에서의 적용이 어렵다. 이에 저자들은 MultiWoz 와 SGD 에 disambiguation turn 을 추가하여 augmentation 한다.

위의 그림에서, 66.7% 의 turn 에서 ambiguity 가 발생하는 것을 볼 수 있다. SGD 와 MultiWoz 에서는 항상 db search 이후, 단 하나의 suggenstion 을 제시한다. 그리고 suer side 에서는 simply accept 한 이후 대화가 진행된다. 이를 통해 dataset 속의 ambiguity 를 avoid 한다.

위의 그림과 같이, 3.1 에서의 CFG 와 MultiWoz/SGD 의 database 속의 slot-value 를 활용하여, system utterance $s$ 를 생성한다. 이후, user 의 utterance $u$ 에서 choice 를 하는 문장을 덧붙인 후, 이를 original dialog 에 concat 한다. turn 이 바뀌지 않기 때문에, 대화가 변하지 않은 dialogue 에서의 effect 를 줄일 수 있다.
모든 domain 에서 이러한 ambiguity 가 발생하는 것은 아니다. 따라서 저자들은 MultiWoz 에서는 restaurant, hotel, 그리고 attraction 에 대해서 진행하고, SGD 에서는 45 개 service 중 24 개에 대해서만 진행하였다. 30% 정도의 dialogue 가 포함되었고, 2% 정도의 turn 이 수정되었다.

3.3 Human Paraphrasing
CFG 를 통해 생성된 user utterance 는 부자연스러울 수 있다. 이에 저자들은 위의 그림과 같이 user utterance 에 대하여 human paraphrasing 을 진행한다.

Human Paraphrasing 에 활용한 interface 는 위의 그림과 같다.

Experiment

Dataset : MultiWoz / SGD
Evaluation : (1) Accuracy on whether the model can successfully predict the correct name entity, (2) Joint Goal Accuracy (JGA) as DST
Model : GPT-2 Experiment : Original/Augmented Data 에 학습한 후, Original/Augmented/Human paraphrased test set 에 test.
Augmentation turn 이 단지 2% 에만 해당하기 때문에, 학습 시 이 turn 의 수 만큼인 SGD 에서 5 천개, MultiWoz 에서 3 천개의 single-turn 을 생성하여 학습시킨다. 이후, 결과표에서 “Syn 100%” 라고 나오는 것은, training turn 의 수만큼 single-turn 을 추가로 학습한 모델이다.

Results and Analysis

5.1 Augmentation Helps Resolve Ambiguity

첫 번째 실험 결과는, 전체 test set 의 2% 에 해당하는, augmented turn 에 대해서, name entity prediction accuracy 를 측정한 결과 이다. Test set 에서의 “original” column 과 “Autoaug” column 을 비교했을 때, Original training dataset 으로 training 한 결과는 0.556 -> 0.242 로 (SGD), 0.676 -> 0.488 (MultiWoz) 로 안좋아졌다. 이는 기존의 MultiWoz/SGD dataset 들이 disambiguition turn 을 거의 갖고 있지 않다는 가정을 검증하는 결과이다. 따라서 clarification question 에 대한 user 의 대답을 이해하지 못한다. 그러나 “AutoAug” Row 들에 대하여서는 0.242 -> 0.496 (SGD), 0.488 -> 0.744 (MultiWoz) 로 좋아지는 것을 확인할 수 있다. 이를 통해 aumgentation skill 로 모델이 dimabigution skill 을 배운다는 것을 알 수 있다. Human paraphrased dataset 에서도 같은 결과를 확인할 수 있다.

2% 에 해당하는 Augmented turn 에 대해서가 아닌, 전체 test set 에 대한 결과는 위와 같다. 변하지 않는 turn 이 많기 때문에, 그 전의 결과처럼 dramatic 한 성능 변화는 볼 수 없지만, 확실히 augmentated data 를 학습했을 때 더 좋은 성능을 보이는 것을 알 수 있다.
Name entity prediction 에 더하여, DST 에서 사용하는 Joint Goal Accruacy (JGA) 를 측정한다.
Table 6 는 augmented turn 에 대하여, table 3 는 전체 test set 에 대한 JGA 측정 결과이다. 두 결과에서 모두 “Aug + Syn100%” 가 가장 좋은 성능을 보인다. 저자들은 augmentation 방법이 disambiguation 해결 뿐 아니라 DST 에도 좋다고 본다.

5.2 Augmentation Brings No Harm
저자들의 utlimate goal 은 “expand end-to-end TOD with t he disambiguation skill” 이다. 이러한 DSR disambiguation augmentation 방법이 original dialog 를 푸는데 harm 이 없음을 확인하기 위해 위의 실험들에서 “original test set” 에 대하여 학습을 진행한 것이다. 당연하게도, Original Training set 으로 학습하거나 Syn 5% 정도만을 추가 학습한 모델은 Original Test set 과 distribution 을 공유하므로 좋은 성능을 보인다. 반면, Augmented Data 로 학습한 모델 역시 Original Test set 에 대하여 마찬가지로 좋은 성능을 보인다. 경우에 따라서는 오히려 더 좋은 성능을 보인 경우도 있다. 따라서, Augmented Data 로의 학습이 original test set 에서의 harm 이 되지 않는다는 것을 검증한다.

5.3 Leveraging Augmented Turns
저자들은 그들의 dataset 들을 효율적으로 활용하기 위하여, SGD 와 MultiWoz 를 함께 배우는 실험 세팅을 제시한다. SGD 와 MultiWoz 는 비슷한 domain 을 공유하기 때문에, 서로 도움이 될 여지가 있다. 따라서 그들은 우선 SGD 를 fine-tune 한 이후, MultiWoz 에 fine-tune 한다. (SGD_ori + Origin) 이 때, SGD 의 경우 augmented SGD training data 로도 fine-tune 한다. (SGD_aug + Origin) 이는 전체 turn 의 2% 만 augmentation 되어 있으므로, 이 turn 들을 upsample 하여 fully augmented SGD 에 대해서도 먼저 학습한 세팅을 추가한다. (Upsample) Name Entity Accuracy 에서는 “Upsample + Syn” 이 가장 좋은 성능을 보였으며, JGA 에 대해서는 “Aug + Syn” 이 가장 좋은 성능을 보였다. 이는 너무 많은 disambiguation turn 을 학습하는 것이 original task 에 영향을 준 것으로 파악된다. 따라서, 가장 좋은 방법은 target dataset 에 대해 제안된 방법으로 Augmention 한 것과 Synthesized single-turn data 를 추가학습 (“Aug + Syn”) 한 모델을 활용하는 것이다.
그리고 “SGD_ori + Origin + Syn” 의 경우, MultiWoz 를 전혀 augmentation 하지 않았음에도 좋은 성능을 보인다. 따라서 SGD 와 MultiWoz 외의 데이터셋 에 대하여서는 저자들이 추천하는 방법은 MultiWoz 와 SGD 의 자신들의 augmented data 를 학습한 이후, original data 를 fine-tune 하고, Synthsized single-turn dataset 역시 학습하는 방법 을 추천한다.

[ICML2022] Data Determinces Distributional Robustness in Contrastive Language-Image Pre-training (CLIP)

Mon, 14 Nov 2022 07:30:00 +0000

[pdf]

Alex Fang¹, Gabriel Ilharco¹, Mitchell Wortsman¹, Yuhao Wan¹, Vaishaal Shankar², Achal Dave², Ludwig Schmidt^{1 3}
¹University of Washington ,² Amazon, ³ Allen Institute for Artificial Intelligence.

Abstract

(Motivation) CLIP, ALIGN, BASIC 과 같은 contrastive learning 기반의 vision-language model 들은 distribution shift 에 굉장한 robustness 를 보인다. 이렇게 큰 robustness gain 을 얻는 원인에 대한 질문은 굉장히 중요하다.
(Solution) 체계적인 실험 조사(systematic experimental investigation) 으로 이 질문에 대해 탐구한다.
(Method) (1) Training set size (2) Training distribution (3) Language supervision at training time (4) Language supervision at test time (5) contrastive loss function 다섯 가지 possible cause 에 대해서 실험 조사를 진행한다.
(Result) (2) Training distribution 이 다양할 수록 robustness gain 이 컸고, 나머지 네 개의 factor 들은 전혀 robustness 에 관련이 없었다.
(New Dataset) Flickr annotation 으로 이뤄진 ImageNet version 의 새로운 dataset 인 ImageNet-Captions 을 공개한다. 이 데이터셋은 controllable vision-and-language training 이 가능하게 한다.

Introduction

CLIP, ALIGN, BASIC 과 같은 vision-and-language large pretrained model 은 다양한 natural distribution shift 에 전례없는 굉장한 robustness 를 보인다. 기존의 모델들이 class annotation 과 함께 image 를 학습한 것에 대조적으로, CLIP 과 그 relative 들은 image 와 그에 상응하는 web 에서 얻은 unstructured text 를 직접적으로 학습한다. 이러한 모델들은 ImageNetV2, ObjectNet 과 같은 어려운 distribution shift 에서 large robustness 를 달성한다 그동안은, Machine Learning 기법의 숱한 발전에도 이 데이터셋들에 대해 이 정도의 향상된 robustness 를 보였던 알고리즘 기술이 없었다. 따라서 중요한 질문이 떠오른다 : “What causes CLIP’s unprecendted robustness?”

Vision 하나만의 기술이 아니라, Language-image model (vision-and-language model) 이 처음으로 large robustness gain 을 성취해냈다는 사실에서, language and image multimodal learning 이 robustness 의 key 가 될 것이라고 예상할 수 있다. 그러나 CLIP의 robustness 의 원인을 pinpoint 하기는 굉장히 어려운데, 그 이유는 CLIP 이 기존의 image classification model 의 학습 패러다임에서 꽤 많은 여러 변화를 통해 학습되었기 때문이다. 예를 들어, 높은 accuracy 를 보이는 CLIP model 은 Vision Transformer (ViT) 구조를 통해 학습이 된다. 그러나 Radford et al. 은 CLIP 논문에서 이미 model architecture 와 size 에 대해서 조사를 했고, 이러한 요소들은 robustness 에 크게 관여하지 않는다는 것을 밝혀냈다. 그럼에도 불구하고, 다음의 여러가지 가능성 높은 요소들이 CLIP 의 robustness 의 원인이 될 수 있다.

The large training set size (400 million images)
The training distribution
Language supervision at training time
Language supervision at test time via prompts
The contrastive loss function

CLIP 의 robustness 를 이해하는 것은 앞으로 reliable machine learning 을 guide 해줄 수 있는 방향을 제시해 주기 때문에 매우 중요하다.

이 논문에서는 위의 제시된 다섯가지 가능성 높은 원인들에 대해 controlled experiment 를 통해 CLIP 의 robustness 의 원인을 밝혀낸다. Main result 는 CLIP 의 robustness 는 training distribution 에 의해 결정된다는 것이다. Training time 에서의 Language supervision 은 standard supervised learning 에 비해 model 을 더 robust 해지게 만들지 않는다. 따라서 Language supervision 은 robustness 에 indirect effect 만 미치고 있다. 상세하게는, language supervision 은 class label 의 consistent annotation 의 필요성을 제거하게 도와주어, image의 diverse distribution 을 간단하게 학습할 수 있도록 도와준다. 다시 한 번 결론은, The more diverse training distribution –– not the language supervision –– then leads to more robust representations. 이다.

CLIP robustness 에 대한 조사를 위한 연구 방향으로 크게 두 가지 방향으로 정리할 수 있다. 첫 번째는, 새로운 데이터셋 ImageNet-Captions 의 소개이다. ImageNet-Captions 는 paired language-image data 로, 120만개의 ImageNet 2012 training set 중 463,622 개의 image 를 original text data 를 augmentation 하여 생성하였다. original text data 는 상응하는 Flickr image 로 부터 추출한다. ImageNet-Captions 은 같은 image 를 통해, 기존의 standard ImageNet training 과, language-image training 두 가지 학습 방법을 controlled experiment 로 비교할 수 있게 도와준다.

두 번째로, CLIP training 과 성능은 유사하지만, vision component 와 language component 사이의 interaction 은 최소화하는,새로운 language-image training 을 위한 baseline 을 소개한다. 특히, 아래의 training procedure 를 소개하고, YFCC-15M dataset 에 대해 그 행동을 illustrate 한다.

(1) YFCC-15M 의 image 만(only image) 을 pre-train 하기 위해 SimCLR 을 사용.

(2) Simple text match 를 통해, ImageNet class 와 YFCC-15M sample 을 matching 하여 (1) 의 resulting representation 을 fine-tuning.

특히, 저자들의 이러한 접근은 language model 에 의존하지 않기 때문에, 훨씬 단순한 언어 처리로 CLIP training 과 유사한 성능을 가져갈 수 있다. CLIP training 을 이해하기 위한 baseline 제공을 넘어서, 저자들의 이러한 단순한 어프로치가 language-image trainig 에 대해 알고리즘적인 개선에 대한 길을 터주었다고 말하고 있다.

Background

CLIP 의 robustness 의 원인을 pinpoint 하기 위해서는 다양한 모델에 대한 robusntess 비교를 위한 precise 한 experimental setup 이 필요하다. 우선, Taori et al. 에 의해 소개된 effective robustness framework 를 background 로 살펴보고, CLIP model 의 robustness gain 에 대해서 실험해본다.

Experimental setup for measuring robustness
Reliable machine learning model 을 만든다는 것은 diverse range of test distribution 에 대해 잘 작동하는 모델을 디자인하는 것을 의미한다. 예를 들어, imageNet 에서 75% 의 accuracy 를 보인다면, 그와 유사한 데이터셋인 ImageNetV2 에 대해서도 (인간이 그러하듯) 75% 와 유사한 성능을 보여야 한다. ([1]) 그러나, 이러한 consistent performance 를 보이지 않고, 대부분의 모델은은 이 distribution shift 에 대해 12 percepnt point 의 성능 drop 을 보인다. ([2]) 반면, Radford et al. 에서 제시되는 CLIP model 은 단지 6 percent point 만의 drop 을 보여 robustness 를 갖는다. ImageNet 에서 뿐 아니라, 다른 많은 distribution shift 에 대해서도 CLIP 은 훨씬 더 적은 accuracy drop 을 보인다. (여기서의 CLIP 은 Radford 의 CLIP model 이 아니라 contrastive learning 기법으로 vision-language task 를 학습한 ALIGN, BASIC 등의 모델을 포함한 기법을 말한다)

수식적으로, model $f$ 와 두 test distribution $D_1$, $D_2$ 에 대하여, $acc_{D_1}(f)$ 와 $acc_{D_2}(f)$ 를 측정하여 비교한다. 보통 $D_1$ 은 ImageNet (ILSVRC-2012) test set 이 되고, $D_2$는 여러가지 다른 out-of-distribution test set 이 된다. 당연히 ideal model 은 두 distribution 에서 100% accuracy 를 보이는 것이지만, 그러한 모델은 존재하지 않기 때문에, 두 accuracy 의 차이가 없는, robustness 를 가지는 것에 대해서 모델 비교를 진행한다. 한 가지 confounder 는 $D_1$ 에 대한 accuracy 가 증가하면 $D_2$에 대한 accuracy gain 이 이미 증가해있다는 것이다. ([3],[4]) 위의 Figure 1 에서, 파란색 점은 imageNet 으로 학습된 모델들이다. x 축은 $acc_{D_1}(f)$ 이고, y 축은 $acc_{D_2}(f)$ 이다. 4 개의 out-of-distribution shift 에 대한 평균값이 y 축 값에 해당한다. 파란색 점으로 scatter 된 ImageNet 으로 학습된 모델들을 보면, (또 다른 모든 모델들에 대해서도) ImageNet accuracy 를 높이는 덕목만으로도 다른 distribution shift 에 대한 accuracy 역시 높아졌다. (우상향 했다)

Robustness 측정단계에서, 이러한 교란 인자(confounder)를 처리하기 위해, Taori et al. 은 robustness 에 대한 정의를 accuracy beyond the baseline given by ImageNet models 로 했다. 그 논문의 저자들은 이 것을 quantity effective robustness 라고 칭한다. Figure 1 에서 파란색 선에서 수직으로 뻗는 Effiective Robustness 가 그것이다. Radford et al. 은 Figure 1 의 purple line 처럼 high effective robustness 를 달성한 CLIP model 을 구현했다고 증명한다. 수식적으로, 이 effective robustness 비교는 다음의 식으로 표현가능하다. Baseline fucntion $\beta$ : $R -> R$ 에 대해, $\beta$는 $acc_{D_1}(f)$ 으로부터 $acc_{D_2}(f)$ 로 mapping 하는 함수이다. New model $f’$ 에 대하여, effective robustness 는 다음과 같이 표시할 수 있다. $\rho(f’) = acc_{D_2}(f’) - \beta(acc_{D_1}(f’))$. 이 수식이 이 논문에서 CLIP model 들의 robustness 를 이해하기 위해 visualize 하는 main quantity 이다.

기존의 Taori et al. 과 Radford et al. 에서와 마찬가지로, natural distribution shift 에 집중하여 실험을 진행한다. Natural variation 은 lighting, geographic location 등을 포함하는 것으로, synthetic distribution shift 와 반대되는 개념이다. Synthetic distribution shift 는 인위적으로 computationally modification 을 준 것으로, Gaussian noise 부여, blur 부여, perturbation 부여 등이 속한다. Natural distribution 은 real data 를 표방하기 때문에, 아래의 natural distribution shift dataset 을 선정한다.

(1) ImageNet-V2 (Recht et al., 2019) : a reproduction of the ImageNet validation set with distribution shift due to changes in the crowdsourcing process.

(2) ImageNet-Sketch (Wang et al., 2019) : black and white sketches of ImageNet images.

(3) ImageNet-R (Hendrycks et al., 2021) : renditions (e.g., art, patterns, etc.) of 200 ImageNet classes.

(4) ObjectNet (Barbu et al., 2019) : real-world objects from ImageNet with crowd-sourced random backgrounds, rotations, and viewpoints

(5) ImageNet-A (Hendrycks et al., 2019) : naturally occurring examples filtered so they are misclassified by a ResNet-50 model.

이러한 distribution shift 로의 effective robostness 의 중요한 property 는 training set 의 size 가 달라진다고해서 effective robustness 에는 영향이 없다 는 것이다. Taori et al. 과 Miller et al. 에서는 이미 training set 의 sub-sampling 이 accuracy 에는 영향을 주지만, effective robustness 에는 전혀 영향이 없다는 것을 증명하였다. 이 것으로 CLIP 의 high effective robustness 에 대해 training set size 는 rule out 된다.

Additional related work
기존의 VirTex, ICMLM, ConVIRT 와 같은 Vision-language model 이 활발히 연구되어 왔지만, CLIP 과 ALIGN 은 굉장히 큰 corpus 에 대해서 학습을 하고, 많은 downstream task 에서 좋은 성능을 보였으며, 전례없는 강한 robustness 를 보유한 모델이다.

CLIP 의 generalization 성능에 대해서 분석을 하는 연구들도 있었다. Devillers et al. 은 CLIP 과 같은 multimodal model 이 few-shot 과 linear probe 결과를 통해 좋은 generalization 성능을 보이는 것에 대해, image 와 text 두 modality 중 하나만을 사용하여 실험을 진행하였다. 실험 분석 결과, 하나의 modality 만을 사용했을 때에 비해 multimodal model 의 이점이 딱히 드러나지 않았다. 반면 우리는 CLIP 의 robustness 에 대하여 language 가 어떻게 out-of distribution generalization 에영향을 주는지를 연구한다. 기존 Devillers et al. 과의 차이점은, 본 연구에서는 accuracy 와 robustness 를 구분하기 위해, in-distribution accruacy 를 control 해서 비교한다는 것이다.

Anderassen et al. 에서는 fine-tuning process 가 진행될 수록, CLIP 의 zero-shot capability, effective robustness 가 줄어든 것을 확인한다. Radford et al. 의 CLIP 이후 ALIGN, BASIC, LiT 등의 유사한 논문이 많이 나왔지만, 본 연구와 가장 유사한 연구는 LiT 이다. LiT는 pre-trained image model 을 사용하고, downstream task 에 대해 text head 만을 fine-tuning 하여 좋은 성능을 얻는 모델이다. 본 연구가 LiT 와 가장 다른 점은 LiT는 zero-shot 성능을 얻기 위해 4 billion image-caption pair 를 fine-tuning 하지만, 본 연구에서는 substring matching 을 통해 caption 을 class label 로 바꾼 후, regular image classifier 를 통해 학습한다는 차이점이 있다.

ImageNet-Captions

저자들은 image-text supervision 을 위한 실험을 위해 새로운 데이터셋인 ImageNet-Captions 를 만들었다. 다음의 네 가지 요구에 의해 ImageNet-Caption 을 생성하였다.

Effective robustness 에 자연어 supervision 의 효과를 isolate 하기 위해, 자연어 supervision 에 더불어 traditional classification label 도 함께 있는 데이터셋이 필요했다. 이 두 label 은 구조적인 차이를 전혀 발생시키지 않고, solely 다른 loss function 만을 통해 다른 모델이 학습되게 실험을 설계할 수 있게 도와준다.
Synthetically 생성된 caption 대신 original image source 로부터 오는 text annotation 이 필요하다. (model bias 를 없애준다)
ImageNet 과 같은 흔히 사용되는 benchmark 와 연관되어 있어야 한다.
최신 연구에 걸맞는 충분히 큰 사이즈여야 한다.

이 연구전에 이러한 점들을 모두 만족하는 데이터셋이 없었다. ImageNet-Captions 은 ImageNet (ILSVRC 2012) training set 의 subset이고, Flickr 로부터 얻은 paired original image title/description/tag 을 갖고 있다. (ImageNet 은 대부부 Flickr 로부터 생성되었다).

Constructing ImageNet-Captions
ImageNet-Captions 의 목표는, ImageNet iamge 에 original text data 를 augment 하는 것이다. 그러나 ImageNet 2012 dataset 에는 어떠한 metadata 도 없어서 그것이 쉬운 일은 아니다. 저자들은 다음 세 가지 fact 로 부터 데이터셋을 구성한다:

ImageNet 의 대부분은 Flickr 로부터 생성되었다.
Imagnet fall 2011 은 URL 을 가지고 있다.
Photo identifier 를 통해, Flickr API 가 associated text data 를 제공할 수 있다.

저자들은 image URL 을 통해 Flickr 에 속해있는 ImageNet fall 11 dataset 을 추려낸 후, 1 천개의 class label 로 제한하여 64만 개의 데이터를 추려냈다. 이후, Jain et al. 의 중복 제거 (deduplication) 방법을 통해 ILSVRC 2012 에 없는 image 를 제거했다. 또, profanity(불경스러운 단어)를 포함한 image 를 제거하니, 463,622 개의 image 가 추려졌다. 이 것은 이제 ILSVRC-2012 의 subset 이면서, original text data 를 갖고 있다. 특히, 이 text data 는 title/description/class label 을 포함하고 있다.

Properties of ImageNet-Captions

ImageNet-Captions 은 90% 이상은 영어지만, 127개의 다른 언어도 포함하고 있다. 그리고 위의 표에서와 같이, 94% 의 경우에서 class label 이 corresponding text 에 포함되어 있다. 따라서, ImageNet-Captions 의 caption 들이 class 에 relevant information 을 포함하고 있고, image-text model 의 training 에 적합한 좋은 caption 을 갖고 있다는 것을 알 수 있다.

Imagenet-Captions experiments

Effective robustness 실험을 위해 ImageNet-Captions 데이터셋을 활용한다. ResNet50 CLIP model 을 IamgeNet-Captions 를 활용해 contrastive loss 로 학습하고, CLIP model 의 vision encoder 위에 additional linear layer를 통해 equivalent image classification dataset 을 학습한다.

Caption construction

ImageNet-Captions 에 대해, caption 으로 어떠한 metadata(title/desc/tags) 를 써야 하는지 선정해야 한다. 이를 위해 여러가지 varaint 에 대한 실험을 했다. Radford et al. 은 영어만을 사용하기 위해 filter 를 사용했는데, 이와 유사한 filter 를 사용하여 variant 를 주었다. 실험 결과, filter 가 image-text pair 개수의 손실을 보상할 만큼 좋은 결과를 내지 못하였으며, 성능에 가장 중요한 것은 size 라는 것을 볼 수 있다.

Robustness
ImageNet-Captions 로 학습한 모델의 robustness 를 보기 위해, ImageNet 과 natural distribution shift 데이터셋들에 대해 비교를 한다.

그림에서 보듯이, ImageNet-Captions CLIP 과 ImageNet-Captions classification 이 거의 유사한 linear trend 를 보이는 것을 볼 수 있다. This shows that CLIP models are not more robust than classification models trained on the same dataset, despite the difference of language supervision 이라고 분석할 수 있다. ImageNet-Captions 에 대한 실험은 ImageNet classification model 보다 더 나은 비교라고 할 수 있는데, 더 이상 different image distribution 에 대한 confounding factor 가 없기 때문이다. (즉, 초록색<-> 주황색 비교가 파란색<-> 보라색 비교보다 훨씬 낫다는 것이다) 그럼에도 불구하고, 이러한 모델들은 Radford et al. CLIP 모델의 robustness 를 볼 수 없었다.

Pre-training on language
따라서 위의 분석 결과로, ImageNet-Captions 의 language supervision 이 model 의 robustness 에 큰 도움이 되지 않는다는 것을 보았다. 그러나 이 분석만으로, OpenAI CLIP model 의 robustness 에 대한 language supervision 기여 여부를 rule out 할 수는 없다. 따라서 저자들은 추가적인 실험을 진행한다. Pre-trained OpenAI CLIP model 의 language encoder 와 randomly initialized vision encoder 를 가져와서, ImageNet-Captions 를 학습시킨다. 이 때, language wieght 의 freeze 여부로 variant 를 준다.

위의 그래프에서 볼 수 있듯이, language head 를 freeze 한 것과 unfreeze 한 것 모두 random initialize 된 것 (초록색점)보다 accuracy 를 좋게 만들었지만, 어떠한 variant 도 effective robustness 를 부여하지 않았다. 따라서 natural language supervision 이 robustness 에 기여했다고는 할 수 없다.

Effect of using templates

OPenAI CLIP model 의 template (“A photo of a {label}”) 과 같이 prompt template 을 줄 경우에 대한 실험에서도, 성능은 좋아지지만, robustness 는 좋아지지 않았다. 따라서 template 역시 robustness 에 cuase 는 아니다.

Improving ImageNet performance using captions

YFCC experiments

지금까지의 실험으로 language supervision alone 은 robustness 를 향상시키지 않는다는 것을 실험적으로 보였다. CLIP 의 robustness 에 대한 더 깊은 이해를 위해, 최소한/혹은 language supervision 이 주어지지 않은 경우, representation 학습이 같은 robustness 를 부여할 수 있을지 검증한다. 이 실험 결과는 CLIP 의 robustness 가 language supervision 이 아닌 다양한 data distribution 으로 부터 온다는 것을 보일 수 있을 것이다.

실험을 위해 Yahoo Flickr Creative Commons dataset(YFCC) 데이터셋을 사용한다. CLIP 의 YFCC datset 에 대해서도 향상된 robustness 를 갖고 있다. YFCC 의 image data 만 사용해도 robustness 를 향상시킬지 테스트 하기 위해, YFCC 의 language part 가 없는 “standard” image representation 을 contrastively pre-train 한다. 이 image-only representation 으로, 최소한의 text processing (substring matching) 으로 zero-shot classifier 를 fine-tuning 한다. 그 결과, 이 zero-shot classifier 가 CLIP 과 유사한 effective robustness를 보인다. This demonstrates that the training distribution, not language supervision at training time, is the main reason behind CLIP’s robustness. 라고 할 수 있다.

Dataset
YFCC-100M 의 subset 인 YFCC-15M 을 활용한다. 이는 English title 과 description 만을 filter 한 것으로, 14,829,396 image 와 함께 자연어 캡션을 가지고 있다. YFCC-15M 의 image classifier 를 학습하기 위해, YFCC-15M 을 classifciation dataset 으로 바꾼 YFCC-15M-Cls 를 만들었다. 단순한 방법론으로 YFCC-15M 에 ImageNet class label 을 부여한다 : title/description 에 class label 혹은 그 synonym 이 보이면 그 것을 label 로 한다. 이러한 label 이 없으면 image 는 버린다. 그 결과, 953 개의 class 를 cover 하는 1,694,125 (11.4 % of full dataset) 개의 image 가 뽑혔다. 가장 많은 class 에는 28만 개의 image 가, 가장 적은 class 에는 1 개의 image 가 assign 된다.

Classification training.
Classification model 은 Vit-B/16 모델에 softmax cross-entropy loss 로 YFCC-15M-Cls 를 finetuning 한다. YFCC-15M 에 pre-trained 된 SimCLR model 로 initialize 한다.

Result.

결과는 위의 표와 그림에서 볼 수 있다. SimCLR + 11% Finetuning classification model 이 CLIP 학습의 결과와 거의 유사하다. 그리고 Avg OOD 의 실험결과 CLIP 의 robustness 와 거의 유사한 결과를 보인다. 이에 대한 해석은 아래와 같다.

Effect of test time prompts

Effect of contrastive training losses

[ICML2022] NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework

Sun, 13 Nov 2022 05:00:00 +0000

[pdf] [github]

Xingcheng Yao^{* 1}, Yanan Zheng^{* 2}, Xiaocong Yang^{3 4}, Zhilin Yang^{1 5 4}
^*Equal Contribution, ¹ Institute for Interdisciplinary Information Sciences, Tsinghua University, ²Department of Computer Science and Technology, Tsinghua University, ³ School of Economics and Management, Tsinghua University, ⁴ Recuurent AI, Inc, ⁵ Shanghai Qi Zhi Institute.

Abstract

(Motivation) Pre-trained Language Model (PLM) 이 NLP task 를 푸는 굉장히 강력한 standard 가 되었지만, train 하기에는 computation cost 가 너무 비싸다.
(Solution) 이 연구에서는 simple and efficient learning framework TLM 을 제안하여, large-scale pretraining 에 rely 하지 않는 학습 방법을 제안한다.
(Method) Labeled task data 와 large general corpus 에 대하여, TLM 은 task data 를 Query 로 하여 general corpus 로부터 tiny subset 을 retrieval 한 후, task objective 를 jointly optimize 한다.
(Result) 4개 domain 의 8개 데이터셋에 대한 실험 결과, TLM 은 PLM 과 비교하여 FLOP 은 두 자리수나 적으면서 성능은 더 좋거나 유사한 성능을 보인다.

Introduction

Pre-trained Language Models (PLMs) 들이 NLP 에서 큰 성공을 거두고 있다. Large general corpora 에 Masked Language Modeling (MLM; BERT, RoBERTa, T5), autoregressive language modeling(GPT-2, GPT-3), permutation language modeling(XLNet) 등의 self-supervised language modeling task 을 활용하여 pre-train 하고, 적은 양의 downstream task 에 대하여 fine-tuning 하는 PLM 은 많은 NLP task 에서 압도적인 성능을 보이고 있다.

그러나, 이러한 PLM 들은 computationally expensive 하다. 예를 들어 RoBERTa-Large 의 경우, 4.36 x $10^21$ 이라는 엄청난 FLOPs 을 요구하며, 이는 무려 1,000 대의 32GB V100 GPU 로 하루를 계산해야하는 양이다. 더 큰 Large Language Model (LLM) 으로 가게 되면, GPT-3 의 경우, 이 RoBERTa-Large 보다도 50배나 더 많은 계산량이 학습에 요구된다. 이러한 엄청난 계산량은 연구계, 특히 학교단위의 연구계에서 새로운 architecture 탐구나, customized LM 탐구, 개선된 pre-training loss 탐구 등의 연구를 limited budget 문제로 불가능하게 만든다. 현재 대부분의 NLP 연구자들은 fine-tuning alogrithm 을 발전시키는데 기대고 있지만, 이는 pre-training procedure 에 대개 upper-bound 될 수 밖에 없다.

기존의 몇몇 연구들(ELECTRA, Primer, [1], EarlyBERT) 에서 language model pre-training 의 효율성을 개선하려는 시도가 있었지만, 대부분은 sample-efficient self-supervised task 를 제안하거나, pre-training 에 알맞는 efficient Transformer architecture 를 제안하는데 그친다. 이러한 연구들은 매우 효율적이고 도움이 되지만, FLOP 측면에서 한 자리수 정도를 줄이는데 그친다. Distillation 으로 PLM 의 size 를 줄이려는 시도들(DistilBERT, TinyBERT) 도 있었지만, 이러한 시도는 학습을 위해, 거대한 PLM training 이 필요하다는 단점이 있다. 그리고 아직까지 distilled version 의 PLM 은 RoBERTa-Large 같은 기존 PLM에 비해 성능이 많이 떨어진다.

이 연구에서는 performance drop 없이 drastic efficiency improvement 를 갖는 완전히 새로운 pretraining-finetuning framework 를 제안한다. 연구자들은 간단하고(simple), 효율적이고(efficient), pre-training-free framework 인 Task-driven Language Modeling (TLM) 기법을 제안한다. Large general corpus 와 some labeled task data 가 주어졌을 때, TLM 은 PLM 에 의존하지 않고 model 을 from scratch 로 학습을 시작한다. TLM 은 두 가지 key idea 에서 motivate 되었다. 첫 번째로, 인간은 시험공부 벼락치기를 위해, 모든 책을 다 보지 않고 단지 몇 개의 chapter 만을 본다. 저자들은 specific 한 task 를 푸는데 있어서 large corpus 를 다 보는 것은 큰 redunduncy 가 있다고 가정한다. 두 번째로, supervised labeled data 를 직접 학습하는 것이, unlabeled data 로 부터 language modeling objective 를 최적화하는 것보다, downstream performance 에 더 효과적이다. 이러한 점들로부터, TLM 은 task data 를 query 로 하여, general corpus 의 tiny subset 을 retrieve 한다. 이후, retrieved data 와 task data를 supervised task objective 와 languge modeling objective 를 jointly optimizing 한다.

4 개 domain - news, review, computer science, biomedical science - 의 8 개 데이터셋 (실험 세팅 : [2])에서, TLM 은 BERT 와 RoBERTa 보다 좋거나 유사한 성능을 보이면서, 무려 2 개 자리수(two orders of magnitude)나 적은 FLOPs 를 사용한다.

Pre-trained Language Models
BERT 이후로 많은 PLM 모델들이 등장하였고, 이들은 많은 NLP 문제들의 de-facto solution 이 되었다. 이들은 거의 대부분 pre-training 으로 large corpus 에서 contextualized token representation 을 배우고, specific task 에 labeled data 를 fine-tuning 해서 학습한다. BERT 는 16 G English corpora 를 MLM 을 이용해 학습하고, RoBERTa 는 BERT 와 구조가 같지만, 160G 의 English text 를 large batch size 와 dynamic token masking 등을 이용해 학습한다. 이 연구에서는 BERT 와 RoBERTa 를 baseline 으로 사용한다.

Efficient Pretraining for NLP
Languge model 의 pre-training 의 efficiency 를 향상시키기 위한 연구가 많이 있었다. You et al. 과 Megatron-LM 에서는 pre-training process를 가속화 하기위해, 데이터 병렬과 모델 병렬처리를 활용한다. 하지만, 병렬처리를 활용한 가속화는 FLOP 측면에서 전혀 줄어들지 않는다. EarlyBERT 와 Prier 에서는 lottery ticket hypothesis 와 Neural Architecture Search 를 이용한 efficient neural network 를 찾았다. 이는 FLOP 측면에서 50% ~ 70% 의 computational cost 를 줄였다. ELCTRA 와 DeBERTa 는 adversarial training 과 disentagled representation of content and position 이라는 새로운 LM pre-training mechanism 을 직접 design 하여 50% ~ 75% 의 computation cost 개선을 가져왔다. Train-no-evil 에서는 selective masking 을 활용한 task-guided pre-training 으로 50% 의 computational cost reduction 을 얻었다. 이 연구에서는 이러한 연구들과는 독립적으로(orthogonal), training data redundancy 를 줄이는 방법 을 통해, efficiency 를 향상시킨다. 이 연구가 훨씬 더 drastic improvement 를 가져온다.

Efficient Inference of Pretrained Models
PLM 연구의 다른 한 줄기는 inference efficiency 를 향상시키는 방향의 연구들이다. DistilBERT, TinyBERT, MobileBERT, FastBERT, BORT, 그리고 BERT-of-Theseus 같은 연구들에서는 small-sized model 을 통해 inference efficiency 를 추구한다. Q8-BERT, Q-BERT, I-BERT 등에서는 quantizing 기법을 이용하여 low-precision representation 을 통해 inference 를 향상시킨다. Pruning 기법 을 활용하여 small size PLM 을 inference 를 위해 사용하는 연구들([3], [4], [5]) 도 있다. 그러나 이러한 model compression 기법을 이용한 방법들은 large PLM 에 의존할 뿐 아니라, 성능도 꽤 큰 차이로 떨어지게 된다. 이 연구에서 제시하는 방법은 PLM 에 의존하지 않을 뿐더러, 성능 역시 비슷하거나 좋아진다.

Domain and Task Adaptation for Pretrained Models
Domain-adaptive fine-tuning 은 pre-trained model 을 in-domain data 에 language modeling obejctive 로 fine-tune 하는 것이다. 이 방법은 domain/task adaptation 에서 좋은 성능이 있음이 밝혀졌다. ([6], [7], [8], [9]) TLM 과의 차이점은, TLM 은 additional domain data 를 필요로 하지 않고, 단지 BERT 와 RoBERTa 의 corpora 만 활용한다. 그리고 기존의 domain-adaptive fine-tuning 방식은 pre-trained model 을 필요로 하지만, TLM 은 그렇지 않다는 차이점이 있다.

Co-training for Semi-supervised Learning and DataDensity-Based Active Learning
TLM 과의 유사성을 갖는 연구로 두 가지가 있다. 첫 번째는 Co-Training (CT) ([10], [11]) 이고, 두 번째는 Data-Density-Based Active Learning (DAL) ([12],[13])이다. CT 와 TLM 모두 unlabeled data 를 certain task 학습을 위해 활용하는 것은 같지만, 2 가지 측면에서 차이점이 있다. 첫 번째는 CT 는 unlabeled data 를 다양한 view 에서 보기 위한 여러가지 distinct model 들이 필요하지만, TLM 은 single model 을 train 한다. 두 번째로 TLM은 unlabeled data 의 selection process 가 있지만, CT 에서는 이 process 가 고려되지 않는다.

TLM 과 DAL 은 unlabeled data 에서 representative instance 를 찾는 flavor 는 동일하다. 그러나, DAL 의 경우 모든 unlabled data 가 task 의 definition 으로 label 될 수 있다는 가정이 있어야 하지만, TLM 은 그것이 필요하지 않다. 그리고, DAL 은 전체 unlabeled data 로 부터 iteratively critical instance 를 찾기 위해 노력하지만, TLM 은 labeld data 와 관련이 있는 relevant instance 를 one-shot 으로 한 번만 찾기 때문에 훨씬 효율적이다. 따라서 TLM 이 classic DAL 알고리즘 보다 훨씬 효율적이다.

Method

TLM : Task-Driven Language Modeling

인간은 제한된 시간과 노력으로 빠르게 특정한 task 를 master 할 수 있는 능력을 지니고 있다. 예를 들어, 시험 벼락치기를 할 때, 전세계의 모든 책을 보는 것이 아니라 단지 몇 개의 chapter 만을 공부하지만, 시험을 잘 볼 수 있다. 이 관찰로부터, 저자들은 빠르고 정확하기 task-relevant information 을 locate 하는 것이 key aspect 라고 가정한다. 결국, TLM 은 (1) general corpora 로부터 relevant training data 를 automatically retreive 하고, (2) retrieved data 와 task data 를 결합하여 학습한다.

수식적으로 보면, general corpus $D = \lbrace d_i \rbrace_i$ where $d_i$ is document, labled task data, $T = {(x_i,y_i)}_i$ where $x_i$ is text and $y_i \in Y$ is a label 에 대해, 목표는 coniditional probability for classification $f(x)=\hat{p}(y \vert x)$ 를 추정하는 model $f$ 를 학습하는 것이다.

TLM 은 위의 그림과 같이 두 가지 step 으로 이뤄져 있다. (1) General corpora 로부터 task data 를 query 로 하여 data 를 retrieve 하는 step (2) Retrieved data 와 Task data 를 language modeling objective 와 task objective 를 이용하여 jointly optimizing 하는 step

Retrieval From General Corpus
Task data 로 부터, top-K document 를 추출한 뒤, combine 한 뒤 subset $S$를 만든다. Subset $S$ 는 general corpus $D$의 tiny subset 이다.

저자들은 효율적인 retrieval 을 BM25 를 활용한다. Embedding-based dense retriever ([13]) 을 활용하면 좋은 retrieval 결과를 얻을 수 있지만, 저자들은 최대한 simple 한 방법을 구사하기 위해 사용하지 않았다. Embedding-based dense retriever 은 additional computational cost 도 필요로 한다. Retrieval performance 와 computational cost 사이의 trade-off 에 대한 연구는 future work 로 남긴다. 그리고, extremely long text 에 대한 retrieval 에서 RAKE 알고리즘같이 keyward 을 query 로 하여 retreival 하는 것이 전체 input sequence 를 query 로 하는 것보다 더 성능이 좋음을 확인한다. 앞으로, retrieved data 인 $S$ 를 external data, text data $T$ 를 internal data 로 여긴다.

[Note] 이 방법은 task-agnostic 하다. 이 방법은 오로지 input text $x$ 에만 의존하고, label $y$ 에는 의존하지 않기 때문이다. 그리고 retrieval procedure 역시 domain-specific data 접근을 가정하지 않는다.

Joint Training

$L_{mlm}(x)$ 는 BERT 와 같은 masked language modeling loss 이고, $L_{task}(f(x),y)$ 는 task-specific loss function 이다. $\rho_1$ 과 $\rho_2$ 는 hyperparameter 이다. Network architecture 는 BERT 와 같으며, CLS head 를 classification 으로, LM head 를 MLM 으로 사용한다. TLM 은 BERT 외의 다른 구조로도 extend 될 수 있다.

학습은 두 stage 로 이뤄진다. 첫 번째 stage 에서, Loss 의 첫 번째줄인 $\rho_1$ batch 의 external data 학습에 두 번째줄인 1개의 batch size 의 internal data 를 끼운다. 두 번째 stage 에서는 $\rho_1$ 과 $\rho_2$ 를 모두 0 으로하여, task-objective 를 이용하여 internal data 만을 finetuning 한다.

Comparison between TLM and PLMs

TLM 과 PLM 의 pretraining-finetuning 모두 두 stage 를 갖는다. 사실, TLM 의 두 번째 stage 는 PLM 의 finetuning stage 와 완전히 동일하다. 두 framework 의 차이는 아래 표에서 볼 수 있다.

PLM 은 task-agnostic knowledge 를 최대한 extremely high cost 를 활용해 배우지만, TLM 은 매우 적은 cost 로 task-related data 만을 학습한다. 앞으로는 TLM 의 pros and cons 를 살펴본다.

Democratizing NLP
기존의 Pretraining-finetuning paradigm 에서, fine-tuning performance 는 pertrained model 에 largely upper-bound 되어 있었다. 그러나 대부분의 NLP 연구자들은, computation resource 의 한계로, large-scale LM 을 training 하려는 엄두조차 낼 수 없었고, fine-tuning 알고리즘을 손보는 것에 기댈 수 밖 에 없었다. PLM 의 디자인 choice 나 pre-training loss 같은 것에 대한 연구는 소수의 연구자들에게만 주어진 혜택이었다. 이러한 점은 PLM 의 연구 및 발전에 대해 속도를 저하시키는 위해 요소가 될 수 있다. 이러한 점에서 TLM 은 NLP 를 민주화(Democratizing) 하고, 많은 연구자로부터 LM architecture, loss function, 알고리즘 등 LM 연구를 가속화 시킬 수 있다.

Efficiency
TLM 은 per-task FLOPs 측면에서 PLMs 을 압도적으로 상회한다. 대부분의 경우에서, target task 는 몇 개 없기 때문에 (few), TLM 은 cost 측면에서 선호된다. 예를 들어, 4 개의 NLI task 를 푼다던지, 하나의 추천 시스템을 푸는 경우는 TLM 의 선택이 reasonable 하다. 하지만, 1,000 개 task 를 푼다고 한다면 (회사에서 NLP platform 을 build 하는 경우 등) PLM 이 아직 더 효과적일 것이다.

Flexibility
TLM 은 task-driven 이기 때문에, flexibility 가 높다. 연구자들은 tokenization, sequence length, data representation, hyper parameter tuning 등에서 custom strategy 를 활용할 수 있다.

Generality
TLM 은 Efficiency 와 Generality 에서 큰 trade-off 가 발생한다. PLM 은 task-agnostic general representation 을 배울 수 있지만, TLM 은 오로지 하나의 task-specific representation 만을 배울 수 있다. TLM 의 generality 를 증가시키는 연구는 future work 이다. 저자들은 multi-task learning 이 돌파구가 될 것이라고 예상하고 있다.

Experiments

실험 setting 은 Gururangan et al. 을 따라간다. Datasets
4 개 domain 의 8개 dataset 에 대하여 실험한다. High-resource data 는 5K 이상의 task data 로, AGNews, IMDB, RCT, 그리고 Helpfulness 이다. Low-resource data 는 ChemProt, ACL-ARC, SciERC, HyperPartisan 이다. General corpora 로는 BERT 의 training corpora 와 RoBERTa 의 training Corpora 를 활용한다.

Baselines
Baseline 은 BERT 와 RoBERTa 이다. 각각 base scale 과 large scale 를 활용한다. TLM 은 number of total training token (products of training step, batch size, sequence length) 을 기준으로 small, medium, large 세 버전을 활용한다. 이 버전은 BERT 와 RoBERTa 의 버전들과 computation cost 가 동일하다.

Main Results

TLM 은 training data size 와 computation cost 를 엄청나게 줄이면서도 유사하거나 더 좋은 성능을 보인다. 특별히, small scale 에서, TLM 은 BERT-Large 보다 1/33 의 FLOPs 과 1/16 의 training corpus 만 사용하고 유사한 성능을 보였다. medium 과 large scale 에서, TLM 은 0.59, 0.24 point 더 좋은 성능을 보였지만, FLOPs 과 training size 에서 두 자리수(two order) 나 적은 cost 를 사용하였다. 결과적으로, TLM 이 highly accurate and much more efficient than PLM 이라고 할 수 있다. 특히, large scale 에서 이러한 점이 더 두드러지는데, 저자들은 large scale PLM 들이 general knowledge 를 너무 많이 학습하여, speficic task 에 대해서 useful 하지 않다고 말한다.

Ablation Study

위의 표는 BM25 와 random retrieval 같은 retrieval method 에 대한 비교 결과와 general corpus 의 size 에 대한 비교 결과이다. 같은 general corpora 에 대해서 BM25 가 가장 좋은 성능을 낸 것을 볼 수 있다. 특별히, BM 25 가 IMDB 에서 random retrieval 보다 1 점 더 좋은 성능을, 나머지 두 low-resource data 에서는 3~4 점이 더 좋은 점수를 보였다. Low-resource data 일 수록 external data 에 더 rely 한다는 저자들의 intuition 이 드러난다고 한다.

General corpora size 비교를 위해 BERT corpora 와 RoBERTa corpora 를 보면, 세 데이터셋에서 모두, general corpora 가 클 때 (RoBERTa corpora 일 때) 성능이 향상되었다. 이 gain (10 배 corpora 에 대한 1점정도의 향상) 은 PLM 에서의 발견과 유사하다. 이 결과를 통해, efficiency 를 유지하면서도, larger general corpora 를 통해 더 높은 성능을 얻을 수 있는 가능성을 볼 수 있다. 반대로, random retrieval 에서는 이러한 효과를 볼 수 없어서, corpus size 에 sensitive 하지 않다는 것을 알 수 있다.

Top-K retrieval 에서 K 에 대한 실험은 위의 표에서 볼 수 있다. High-resource data 인 AGNews 에서는 K 값에 크게 상관이 없었지만, Low-resource 에서는 K 값이 크면 클 수록 좋은 결과를 얻었다. 따라서, 저자들의 intuituion 이었던, low-resource task 는 joint training 을 통한 external data 에 더 의존한다는 것을 실험적으로 증명할 수 있었다.

LANGUAGE MODELING WEIGHTS $\rho_1$ AND $\rho_2$

먼저, $\rho_1$ 에 대해서 보면, high-resource 인 Helpfulness 같은 경우, smaller $\rho_1$ 에서, low-resource task 인 SciERC 나 ChemProt 에서는 higher $\rho_1$ 에서 좋은 결과가 있었다. 이는 low-resource 는 external data 에 대한 의존도가 크다는 이전 결과와 유사한 해석을 할 수 있다. 그리고 task data 를 사용하지 않고 external data 만을 활용했을 때는, 좋지 않은 성능이 나왔으며, small task data 의 필수불가결성을 확인할 수 있었다.

아래 표에서는 language modeling 이 필수적임을 확인할 수 있다. $\rho_2$ 가 0 이 아닐 때, (정확히는 20 혹은 100 일 때 ) 가장 좋은 성능을 보인다.

Seoncd stage of training

위의 표에서 볼 수 있듯이, two-stage training 일 때 성능이 좋았다. Second stage 를 제거하면 최종 성능이 지속적으로 나빠졌고, second stage 가 필수불가결(indispensability) 하다는 것을 알 수 있다. Low-resource 에서는 특히나 second stage 가 큰 영향력을 미친다.

MLM loss on task data

First stage 에서 TLM 은 masekd langugae modeling loss 를 task data 에 활용한다. 이 것이 영향이 있는지 확인하기 위해, PLM 에 task data 에 대한 MLM 을 추가하였을 때, 위의 표에서 보듯이 큰 영향이 없는 것을 볼 수 있다. 저자들은 task data MLM 보다 TLM 의 relevant retrieved data 의 MLM 이 PLM 의 general corpora MLM 보다 더 좋다고 이야기하고 있다.

Analysis

Attention weight visualization

저자들은 TLM 과 PLM (pretraining-finetuning framework) model 의 behavior differnce 를 attention weight visualization 으로 본다. Voita et al.은 최소한 90% 의 maximum attention wieght 이 인접한(adjacent) token 에 assign 되어있는 “positional head” 같은 specific kind of head 가 final prediction 에 지대한 영향을 끼친다는 것을 보였다. 또 다른 중요한 head 는 [CLS], [SEP], period token (‘.’) 에 maximum attnetion weight 이 부여되어 있는 head 가 적은 semantic/syntatic information 을 enocde 할 가능성이 있다고 말한다. 이러한 head 를 “vertical head” 라고 명명한다. 위 그림에서, TLM 에서 더 많은 “positional head” 가 발견되고, 더 적은 “vertical head” 가 발견된다. 이는 다른 task 에서도 똑같이 관측이 되는데, TLM 이 PLM 과는 다른 pattern 의 attention 을 학습하며, 저자들의 주장으로는 more informative 한 attention 을 배운다고 주장한다.

Case study on retrieved data

Case study 는 위의 표에서 볼 수 있다. BM25 는 sparse feature 에 기반하므로, semantic 유사도 보다는 lexcial 유사도에 더 focus 되어 있다. 이는 특정 noun 이 많이 발견되는 professional domain 에 더욱 beneficial 하다. (ex. SciERC for Computer sciecne, ChemPort for biomedical science) 이런 professional domain 외의 domain 에서도 BM25 가 잘 하는 것을 볼 수 있다.

Results on More Datasets

지금까지는 Gururangan et al. 의 실험세팅을 따라했으나, 더 많은 실험결과를 위해 BERT 에서 사용한 GLUE benchmark 에 대해서 실험을 진행한다. 위의 실험에서 small scale setting 에서 실험을 진행했을 때, cost 는 압도적으로 줄이면서 BERT-base 와 거의 모든 benchmark 에서 유사한 성능을 보였다.

[ICML2022] Describing Differences between Text Distributions with Natural Language

Sat, 12 Nov 2022 08:38:00 +0000

[pdf] [github]

Ruiqi Zhong¹, Charlie Snell¹, Dan Klein¹, Jacob Steinhardt¹
¹ Computer Science Division, University of California, Berkeley. Correspondence to: Ruiqi Zhong

Abstract

(Motivation) 두 text 의 distribution 이 다르다는 것을 어떻게 알 수 있을까? 인간은 많은 sample 을 직접 읽는 과정이 필요하므로 굉장히 많은 시간이 걸린다.
(Solution) 이 논문에서는 GPT-3 를 활용하여 automatically describe distribution of text 방법을 제안한다. 이 방법은 기존에 없던 새로운 framework 로 다른 다양한 task 에도 적용 가능하다.
(Method) “[samples of $D_0$] + [samples of $D_1$] + the difference between them is __.” 의 prompt 를 이용하여 GPT-3를 fine-tuning 시킨 뒤, 생성되는 decription sentence 를 각 dataset 에 얼마나 matching 되는지로 reranking 한다.
(Result) 기존 GPT-3 Curie (13B) 모델은 human annotation 과 7% 의 유사도를 보이지만, fine-tuning 이후 61% 로 증가하였으며, GPT-3 Davinci (175B) 모델을 활용했을 때는 76% 나 올라, 제안한 방법으로 생성한 description 이 text distribution 을 잘 표현함을 실험적으로 증명한다.

Introduction

“What inputs trigger a neuron in my deep learning model? How are the train and test distributions different for my application? How did public opinions on Twitter change from last year to this year?” 이러한 질문들을 생각했을 때, 인간이 이러한 new pattern 을 발견하는 것은 많은 sample 을 직접봐야하고 intractable 하다. 본 논문에서는 두 distribution 사이의 difference 를 발견하고, 그 difference 를 자연어 문장으로 describe 하는 방법을 제안한다.

제시하는 방법론은 Learning a natural language hypothesis 라는 방법으로, two text distribution $D_0$ 와 $D_1$ 에 대해서, $D_0$ 보다 $D_1$ 을 더 잘 설명하는 natural language hypothesis $s$ 을 찾아내는 방법이다. 위의 그림과 같이, $D_0$ 와 $D_1$ 이 있을 때, “is military-realted” 라는 문장이 $D_0$ 보다 $D_1$ 을 설명할 hypothesis 이다. 또 추가적으로, $D_0$ 를 train set 으로, $D_1$ 을 test set 으로 놓음으로써, train-test set distribution difference 를 설명하는 자연어 문장을 만들 수 있다. 그 예시는 “is longer in sentence length” 등이 있다. 그리고, 마지막으로 public opinions shift 에 대해서도 적용 가능하며 그 예시로 “is optimistic about the pandemic.” 등이 있다.

이 방법론은 GPT-3 Davince 에 prompt 를 이용하여 hypotheses $s$를 생성한다. 하지만 GPT-3 는 limited context size 를 갖고 있기 때문에, 이러한 prompt 는 단지 몇 가지(few) sample 만을 담을 수 있고, whole distribution 은 담을 수 없다. 따라서, 저자들은 re-ranking 방법을 통해, candidate 들이 larger set of sample 에 대해서 얼마나 잘 설명할 수 있는지 확인하는 verifier 를 도입한다. 이에 대한 설명은 위의 그림에 나와있다.

그리고, GPT-3 는 hypothesis 를 propose 하는데 최적화 되어있지 않기 때문에, fine-tuning 을 통해 더 발전될 수 있다. 하지만 이러한 task 를 위한 corpus 가 존재하지 않으므로, 저자들은 GPT-3 를 이용하여 data 를 collection 하여 fine-tuning 에 사용한다. 위의 그림과 같이, hypothesis $s$ 에 대해, GPT-3 를 활용하여 sample 들을 generation 한 이후, 그 것들을 human 이 annotate 하여, proposer fine-tuning 에 활용한다.

저자들은 54 real-world binary classification datasets 에 대해서 검증을 진행한다. 이 dataset 들은 positive class 들에 대해 자연어 description 으로 annotate 되어 있다. 이 문제로 적용을 위해, positive/negative class input 들을 $D_1$/$D_0$ 로 여기고, top-5 description 이 human annotation 과 일치하는 지 비교한다. GPT-3 Curie (13B) 모델을 적용했을 때는 7%의 일치도를 보였지만, fine-tuning 이후 61% 의 일치도를 보여 크게 향상되었고, GPT-3 Davinci model 을 했을 때는 76% 에 도달하였다.

이후, 저자들은 기존 존재하던 classification dataset 들이 자신들이 제안하는 시스템의 desciption 과 agree 하는지 실험을 진행한다. 이 시스템은 subjectivity analysis 에서 SUBJ dataset , 데이터셋이 movie review와 plot summary 를 contrast 하는 것으로 구성이 되어있음을 recognize 했지만, 많은 연구에서 이러한 점을 모른 채 zero/few-shot dataset 으로 활용하고 있다고 지적하고 있다. 그리고 제안된 시스템은 여러 데이터셋들의 단점을 지적하고 있다. 예를 들어, MNLI 에서 “contradiction class” 에 “negation” 이 spuriously 관여하고 있으며, SMS Spam classification dataset 의 경우, spam 으로 분류된 것들은 항상 hyperlink 를 포함하고 있음을 발견했다. 그리고, 이 시스템은 text clustering 에도 사용될 수 있다.

Learning a Natural Language Hypothesis

X 를 set of all text input 이라고 하면, natural language hypothesis $h$ 는 string $s$ 에 parameterized 되고, 다음과 같이 two input 을 boolean 으로 mapping 한다.

where $h_s(x_1,x_0) = 1$ means $x_1$ is more $s$ than $x_0$.
예를 들어, $s$ 가 “is longer in sentence length” 일 때, $h_s(x_1,x_0) = 1$ 은 $x_1$ 이 $x_0$ 보다 길다는 것을 의미한다. 정리하면, $h_s$ 의 semantic 은

으로 정리할 수 있다.
$D_0$ 와 $D_1$ 이 X 의 두 distribution 이라고 하고, $H$ 를 $h$ 의 space 라고 했을 때, 이 task 의 목적은 $H$ 속의 $h$ 중 다음의 “classification accuracy” CA 가 높은 것을 찾아내는 것이다.

식에 대해서 잠시 살펴보면, 두 distribution $D_0$ 와 $D_1$ 으로 부터 뽑힌 sample 들에 대해, $h$ 가 어디로 부터 오는지를 classify 하는 기존의 statistical machine learning 과 같다. 하지만, traditional statistical machine learning 과 다르게, 이 문제는 두 가지 문제를 가지고 있는데, 첫 번째는 Search 문제로, discrete string space 에서 hypothesis 를 searching 하는 것은 어렵다는 것이다. 그리고 두 번째는 Verify 문제로, $h_s(x_1,x_0)$를 계산하는 데는 human annotation 이 필요한데, 이 것은 매우 비싸다는 것이다. 이 연구에서는 neural network 로 human response 를 approximating 하는 방법에 대해서 다룬다.

Method

본 논문에서는 GPT-3 를 prompt 하여 small set 에 대해 hypothesis 를 만들고(1), UnifiedQA 를 통해 larger set 에서 hypothesis 를 검증하고(2), data collection pipeline(3) 을 통해, proposer 와 verifier 를 fine-tuning(4) 한다. 이 과정들은 위의 그림에 요약되어있으며 하나씩 차례대로 살펴본다.

(1) Hypothesis Proposer

저자들은 GPT-3 를 이용하여 hypothesis 를 생성한다. 그림과 같이 $D_1$ 으로부터 몇 개의 sample을, $D_0$ 로 부터 몇 개의 sample 을 추출 하고, “Compared to group 0, each sentence from group 1 ___” 이라는 prompt 를 집어 넣어준다. GPT-3 는 2048 의 context size limit 이 있기 때문에, 각 sample 크기는 5 개이다. Controlled decoding 기법이 없으면, prompt completion 이 “is more positive, while sentences from group 0 are ungrammatical.” 과 같이 나타난다. 그러나, 이러한 completion 은 undesirable 한데, verifier 가 한 번에 두 개 (positive, ungrammatical) 을 확인해야 하고, 두 번째 hypothesis 는 group 을 평가해야 하는데, verifier 는 sample 들을 평가할 수만 있기 때문이다. 따라서, 저자들은 GPT-3 가 “group” 이라는 token 을 decode 하는 것을 막고, “,” 와 “or” 같은 token 을 생성하는 것을 금지시킨다.

그리고, $D_0$ 와 $D_1$ 이 완전히 같거나 많이 유사할 경우, optimal hypothesis $h^*$ 는 이들을 잘 구분할 수 없어야 한다. 그러나, 몇 가지 sample 을 뽑아서 GPT-3 를 prompt 할 경우에는 이 optimal hypothesis 를 만족시킬 수 없으므로, proposer 를 혼동시킬 수 있다. 이 것을 막기 위해 저자들은 RoBERTa-Large model 을 학습시켜, 각 sample 이 $D_0$ 와 $D_1$ 중 어디서 오는지 예측하게 한 다음에, confidence score 를 기준으로 top-$p$ group 을 만든다. 실험에서는 top-5, top-20, top-100 group 에서 각각 10 번씩 sample 들을 뽑은 후, 2 개의 completion 을 만들게 하여 최종적으로 3 x 10 x 2 = 60 의 hypotheses 를 얻고, 이를 re-rank 한다.

(2) Hypothesis Verifier
위의 CA 수식을 검증해야하는데, $h_s(x_1,x_0)$ 는 expensive human annotation 이 필요하기 때문에, neural network 를 이용하여 approximation 한다.

neural network $V$ 에 대해, $V(s,x_1,x_0)=1$ 은 $x_1$ 이 $x_0$ 보다 더 $s$ 하다는 것을 의미한다.

이후, 저자들은 UnifiedQA 를 verifier 로 활용한다. 이 것은 T5 model 을 기반으로 한 Question answering 모델이다.

위의 그림과 같이, context $c$ 는 pair of sentence A from $D_1$, and sentence B from $D_0$ 이다. question $q$ 는 “is it true that sentence A is more positive?” 이고, “is more positive” 부분은 hypothesis $s$ 이다. 이후, 이것을 QA 모델인 UnifiedQA 에 돌렸을 때 1 이 나오면 “yes”, 아니면 “no” 가 나온다. 이후, $V(s,x_1,x_0)$ 값을 통해 CA 를 re-ranking 한다. 전부 re-ranking 하지는 않고, 400 개의 random $(x_1,x_0)$ sample 에 대해서만 $V(s,x_1,x_0)$ 값을 구하고, 최종적으로, 5 개의 hyphothesis $s$ 를 남긴다.

(3) Collecting Data for Supervision

(1) 에서 proposer 로 사용된 GPT-3 와 (2) 에서 verifier 로 사용된 unifiedQA 모두 이 태스크를 위해 학습된 것이 아니기 때문에 최적화되어 있지 않다. 따라서 fine-tuning 을 통해 그 성능을 향상 시킬 수 있다. 그러나 이러한 태스크를 풀기 위한 corpus 가 없기 때문에 fine-tuning 을 진행할 수 없기 때문에, new dataset 을 collect 한다.

Proposer 의 fine-tuning 을 위해서는 more $s$ 스러운 5 개의 sample, less $s$ 스러운 5 개의 sample 이 있어야 하고, verifier 를 fine-tuning 하기 위해서는 $x_1$ 이 $x_0$ 보다 더 $s$ 스러운 triplet $(s,x_1,x_0) 가 필요하다. 이를 위해서 저자들은 특정 hypothesis $s$ 에 대해, GPT-3 에 $s$ 를 만족하는 sample 과 그렇지 않은 sample 들을 생성시키게 하였다.

Curating Hypothesis

첫 번째로, hypothesis 를 여러개 추출하기 위하여, GPT-3의 도움을 받는다. 자세히는 10개의 hypothesis 를 직접 생성한 후, GPT-3 에게 “brainstrom 해” 라는 prompt 를 활용해 생성하였다. 생성되는 hypothesis 는 shallow (e.g. “contains the word ‘yay’ at the ned of the sentence’) 한 것부터, topical (“loves school”)한 것, 그리고 social and linguistic cue 를 다루는 complex 한 것(“supports universal healthcare”, “is written in first person”)까지 다양하다.

Conditional Generation

hypothesis $s$ 가 “love school” 이라고 했을 때, positive sample 은 “My advisor is really helpful and I learned a lot” 등이 있다. 모델을 fine-tuning 하기 위해서는 positive sample 과 negative sample 이 모두 필요하다. Positive sample 을 생성하기 위해 위의 그림처럼 GPT-3 모델을 활용한다. 가끔 $s$ 가 “love school” 인데, “I love school” 과 같이 겹치는 문장이 생성될 수 있어, $s$ 에 나오는 token 을 생성하지 않게 막아놓는다.

Negative sample 을 생성하기 위해서 다른 hypothesis 의 positive sample 을 이용한다. “talks about microwaves” 와 같은 highly-specific 한 예시에 대해서는, 다른 아무 hypothesis 의 positive sample 이 negative sample 이 될 수 있다. 그러나, “uses past tense” 와 같은 경우, 직접 contrast hypothesis 인 “uses future tense” 를 만들었다. 이렇게 expanded hypothesis pool 이 352 개로 늘어났고 (기존 300개), 이 것들을 이용하여 15 postive sample 들을 만들어 negative sample 로 활용한다.

Verifying with Human Annotations.

instruct GPT-3 의 성능이 매우 좋지만, reliability 를 위해 human tucker 를 활용하여 verify 한다. 이 Majority vote 를 통해 302 hypothesis 와 각각에 대응하는 5개의 positive/negative sample 들이 남았다.

Fine-tuning
Proposer fine-tuning 을 위해 302 개의 hypothesis 에 대해서, positive/negative sample 5 개 씩 주어주고, hypothesis 를 generate 하게하여 GPT-3 를 fine-tuning 하였다. 2 epoch 을 돌리고, 20 batchsize, 0.05 의 learning rate 를 사용하였다.

Verifier fine-tuning 을 위하여, $V(s,x_1,x_0)=1$ 이 되게, $V(s,x_0,x_1)=0$ 이 되게 하여 unifiedQA 를 fine-tuning 하였다. 하나의 $s$ 당 30 개의 $(x_1,x_0)$ pair를 생성하였고, 250 step, batchsize 32, lr 5e-5 를 사용하였다. out-of-distribution robusteness 를 위해, 기존 unifiedQA 의 weight 과 fine-tuned unifedQA weight 을 average 하였다.([1])

Benchmarking Performance

Dataset
저자들의 previous paper 에서, 54 개의 binary text classification task 에 대해 positive class 에 하나 이상의 자연어 description 이 있는 eval set 모음을 차용한다. 이 eval set 들에는 topic classifciation, grammaticallity classifciation, stance classification 등을 포함한다. 각각에 대하여, 제안된 시스템에 positive class sample 들이 negative class sample 들과 어떻게 다른지를 설명하도록 시키고, human annotation 과 top-5 비교를 한다. human annotation description 을 위한 $s^*$ 를 “correct” 라고 가정한다.

Evaluated Systems.
larger proposer, a fine-tuned proposer, and a verifier for re-ranking 의 세 요소를 모두 갖추면 description generation 성능이 올라갈 것이라고 추측한다. 따라서 저자들은 “(1) : Our best system which use fine-tuned GPT-3 Davinci (175B) as the proposer, (2) : a smaller proposer size (fine-tuned Curie, 13B), (3) : no fine-tuning (zero-shot Curie 13B), (4) : no fine-tuning (zero-shot Cuire, 13B) + no verifier for re-ranking, (5): “memorization proposer”, where the proposer only generates the hypothesis we curated” 라는 5 개의 모델을 제시하고, 그들의 가정이 맞다면, (1)>(2)>(3)>(4), 그리고 (2)>(5) 가 될 것이라고 추측한다.

Automatic Evaluation.
Automatic metric 으로는 BERTScore 를 활용한다. Human annotation 와 top-5 description pair 들을 BERTScore 로 계산한다. 54 개의 task 에 대해 average 한 후, 5 개의 top-5 중 가장 높은 pair 를 선택한다. 그 결과, (1) : 0.930, (2) : (0.927), (3) : 0.907, (4) : 0.899, (5) : (0.916) 으로, 저자들이 추측한 결과가 나왔다. 하지만. 이 결과들이 모두 높게 측정이 되었기 때문에, manual evaluation 을 추가적으로 진행한다.

Manual Evaluation.

사람들에게 위와 같이 평가해달라고 했을 때, 아래와 같이 모델들에 대해서 결과가 나왔다.

(4)번 모델 GPT-3 Curie(no fine-tuning proposer + no re-ranking model) 은 (A)+(B) 평가에서 7% 의 human annotation 과 일치하지 않았지만 (4/54), (2)번 모델은 GPT-3 Curie의 proposer fine-tuning 을 통해 61% 일치도 (33/54), (1)번 모델은 GPT-3 Davinci proposer fine-tuning 를 통해 76% 의 일치도(41/54)를 보인 것을 확인할 수 있다.

Comparing Verifiers.

저자들은 verifier 가 실제로 효과적인지 실험적으로 검증할 수 없었다고 한다. 그러나, repeatedly 반복되는 hypothesis 를 verifier 가 제거해주는 효과가 있다고 한다.

Verifier 를 비교하기 위해, 위의 CA 수식에 대하여, 저자들은 larger and fine-tuned verifier 가 더 좋을 것이라고 추측한다.

결과는 위와 같은데, CA 수식은 여전히 approximation 이므로 automatic evaluation 은 infeasible 하지만, unifedQA 가 verifier 로서의 역할을 하고, fine-tuned verifier 의 효과가 더 좋았다. 그리고, 실제 state-of-the-art model 은 unifiedQA 보다 25x 크기 때문에, 그래프의 해석대로라면 훨씬 더 좋은 성능을 보일 수 있다.

Application

본 연구의 시스템은 summarizing training task, debugging dataset shortcut, describing distribution shift, 그리고 labeling text cluster 에 사용될 수 있다.

Summarizing Training Tasks
SUBJ dataset 은 subjective vs. objective text 를 구분하는 binary classification task 의 datset 이다. 많은 연구에서 해당 dataset 을 zero/few-shot classification benchmark 로 활용한다. 그러나, 본 연구의 시스템으로 SUBJ 를 describe 해보니, objective class 에는 “is a plot summary of a film” 이, subjective class 에는 “is a quote from a film review” 문장이 생성되었다. 이에 저자들은 SUBJ dataset 논문을 읽어보니 아래와 같은 구절을 발견하였다.

따라서, 본 연구의 시스템에 따른 description 이 정확하다는 것을 알 수 있다. SUBJ 이후로도, 같은 방법으로 sub vs. obj 를 뽑은 데이터셋이 여럿 있어서([2], [3]) 유의가 필요하다고 지적하고 있다.

Debugging Dataset Shortcuts
Natural Language Inference (NLI) 에서 대표적으로 사용되는 MNLI dataset 의 경우, 본 연구의 시스템으로 negative class description 을 뽑아보니 “negation” 의 포함 여부로 description 을 생성하였다. 따라서 MNLI datset 들의 contradiction class 는 “not”, “never” 같은 “negation” 이 포함이 많이 되어있다는 것이고, 이 것을 푸는 모델들이 이 것을 발견하도록 설계되었을 수 있음을 지적하고 있다.

또, 다른 예시로, spam classification dataset 에 많이 사용되는 Gomez et al. 은 spam 으로 분류된 negative class sample 들이 hyperlink 를 모두 포함하고 있다고 한다. 이는 본 연구의 시스템이 “has a higher number of hyperlinks” 라는 description 을 생성한 것으로부터 알 수 있었다고 한다.

Describing Distribution Shifts.
본 연구의 시스템을 통해 training-test set distribution difference 를 설명하는 것도 가능하다. 또 다른 예시로는, TwitterPPDB 와 QQP 는 모두 paraphrase detection dataset 이지만, 전자는 tweet 에서, 후자는 Quora question 에서 구성된 데이터셋들이라, 전자의 설명으로는 “talks about a news story more”, 후자의 설명으로는 “contains a question.” 으로 설명이 다름을 통해 distribution shift 를 찾아낼 수 있다고 한다.

Labelling Text Clusters.
본 연구의 시스템으로 쉽게 unlabeld text clustering 을 진행할 수 있다.

이를 위해, 우선 RoBERTa Base 로 wikitext-2 를 embed 한 후, Aharoni & Goldberg clustering 방법을 이용해 64 cluster 를 생성한다. 그 중, 10 개를 evaluation 을 위해 뽑은 후, 한 명의 저자가 그 cluster 속의 20개의 sample 을 읽은 후, cluster 를 설명하는 자연어 description 문장 s* 를 생성한다. 이후, 모델이 내어놓은 top-5 description 에서 가장 좋게 평가한 하나의 description $\hat{s}$ 를 고른다. 이후, 결과는 위의 그래프와 같다. 10 개의 모든 cluster 에 대하여, 평균적으로 본 연구의 시스템은 CA=0.8 을 달성하였지만, expert는 0.77 을 달성하였다. 즉, 시스템이 expert 와 비교해서 거의 같거나 더 낫다고 이야기하고 있다.

[ICML2022] What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

Fri, 11 Nov 2022 06:48:00 +0000

[pdf] [github]

Thomas Wang^{* 1}, Adam Roberts^{* 2}, Daniel Hesslow³, Teven Le Scao¹, Hyung Won Chung², Iz Beltagy⁴, Julien Launay^{3 5}, Colin Raffel¹
^* Equal Contribution, ¹ Hugging face, ² Google, ³ LightOn, ⁴ Allen institute for AI, ⁵ LPENS, Ecole Normale Superieure.

Abstract

(Motivation) Large Language Model (LLM) 이 zero-shot generalization 에 좋은 성능을 보인다. 하지만, 많은 State-of-the-Art 모델들이 각기 다른 architecture 와 pre-training objective 를 통해 학습되는데, 이 요소들에 대한 체계적인 비교(systematic comparison) 이 적다.
(Solution) 이 논문에서는 여러 LLM 들의 modeling choice 와 zero-shot generalization 에 대한 영향력을 평가하는 large-scale evaluation 방법을 제안한다.
(Experiment) (causal decoder-only / non-causal decoder-only / encoder-decoder) 세 구조 (architecture) 와 (autoregressive, masked language modeling) 두 가지 pre-training objective, 그리고 with/without multitask finetuning 조합들을 실험을 진행한다.
(Results) causal decoder-only + autoregressive 방법이 zero-shot 성능은 제일 좋았으나, non-causal + mlm + multi-task finetuning 방법이 실험 성능은 제일 좋았다.

Introduction

위의 그림의 module 블록들의 색과 아래의 색을 연결

Large Language Model(LLMs) 들은 unstructured text data 에 pre-train 된 후, additional training or labeled data 없이 다양한 방면의 task 에서 좋은 성능을 보인다. 이러한 능력을 zero-shot generalization 이라고 한다. 현재 대부분의 LLM 들은 Transformer architecure 기반으로 구성되어 있다. Original Transformer 는 encoder-decoder 구조로 되어있지만, 많은 최근 LLM 들은 causal decoder-only 모델로, auto-regressive 방법으로 학습한다([1], [2], [3]) 그러나, T5 model 에서는 Encoder-Decoder (ED) model 을 통해 transfer learning 으로 decoder-only LLM 을 outperform 한다. 추가적으로, UniLM 등의 Non-causal decoders-only 구조는 attnetion mask 를 활용하여 decoder-only 와 encoder-decoder model 의 구조 사이의 gap 을 줄인다.

최근의 연구([4],[5],[6])에서는 prompted task 들의 ensemble 의 multitask finetuning stage 을 통해 encoder-decoder model 과 causal decoder-only 모델에서 엄청난 성능향상을 이끌어내었다. 이에 따라, multitask fintuning 의 conjunction 과 architecture choice 의 조합에 따른 zero-shot generalization 성능에 대한 의문점이 제기된다.

Transformer 모델들은 다양한 self-supervised training objective 를 가질 수 있다. 보통, causal decoder-only LLM 들은 full language modeling (FLM) objective 로, encoder-decoder model 은 masked language modeling (MLM) objective 로 학습을 진행한다. MLM 에는 span corruption 등이 포함될 수 있다. 추가적인 downstream task finetuning 에 대해, MLM 의 효과에 대해서는 이미 많은 연구에서 검증이 되었다. 최근 가장 강력한 성능을 보이는 T0 model 역시 MLM 을 사용하였다. 최근, Lester et al. 은 adaptation stage (extending pretraining but with a different objective) 를 소개한다. 이는 MLM 모델을 prompted text generation task 를 수행하는 것을 가능하게 하며, objective 사이의 gap 을 줄여준다.

이러한 결과들은 which architecture and which pre-training objective pair 가 LLM 에 가장 강력한 (strongest) zero-shot generalization capability 를 부여하는지에 대한 의문점을 남긴다. 기존에 이러한 구조와 목적함수 조합에 대한 연구가 있었지만, zero-shot 성능에 관한 연구는 아니었고, 대부분 transfer learning 을 위한 연구([7], [8]) 였다. 또, 추가적으로 최근 multitask finetuning 이 효과적이라는 것이 증명되면서, 어떠한 조합이 multitask finetuning 과 잘 맞을지에 대한 궁금증도 생긴다.

Large-scale systematic study

이 논문에서는 architecture 와 pre-training objectives 의 조합에 따른 zero-shot generalization 의 성능에 대한 실험을 진행한다. 그림에서와 같이, causal decoder-only, noncasual decoder-only, encoder-decoder architecture 들과, full, prefix, mlm 의 여섯 가지 조합으로 실험을 진행한다. 추가적으로, with and without multitask finetuning 역시 평가한다. 실험은 large-scale 로 진행한다. 5 billion parameter (11 billions for encoder-decoder) on 168 billion token 으로 학습하고, multitask finetuning 은 13 billion token 에 수행한다. Evaluation set 으로는 T0-Eval 과 Eleuther AI LM Harness (EAI-Eval) 을 활용하고, 이들은 다양한 prompt 들의 30 개의 downstream task 를 갖고 있다.

Multitask finetuning impacts architecture and objective choice

저자들은 FLM objective 로 학습된 causal-decoder model 이 (GPT-3 와 유사) pre-training 직후 바로 zero-shot 을 잴 때 좋은 성능을 보이는 것을 발견했다. 그러나, multitask finetuning 을 진행한 이후에는 오히려 MLM 으로 학습한 모델이 더 좋은 결과를 보였고, FLM 으로 학습한 causal decoder-only 모델은 좋지 않았다.

Bridging across architectures and objectives with adaptation

여러 조합에 대한 adaptation 으로 두 가지를 고려한다. 첫 번째는 full language modeling adaptation 으로 MLM-trained non-causal decoder model 을 FLM + causal decoder 로 변환한다. 이렇게 할 경우, FLM task 에서 1.6 배 빠르게 수렴을 한다. 두 번째는, non-causal MLM adaptation 으로, FLM + causal decoder 를 MLM + non-causal decoder 로 바꾼다. 이렇게 바꾼 경우, MLM task 에 대해 3.3 배 빠르게 수렴한다. 이러한 adaptation 방법은 new version of model suited for multitask finetuning 을 생산하고, benchmark 에서 두 번째로 좋은 성능의 결과를 보인다.

Background

Transformer
거의 모든 LLM 들은 transformer 기반으로 설계된다. LLM 을 구성하는 데는 여러 기술들이 굉장히 많이 쓰이기 때문에, 다른 것은 제한하고 main architecutre 만 생각했을 때 Transformer block 이 main architecutre 이다. Transformer block 은 multi-head attention, layer normalization, dense two-layer feedforward network, residual connections 로 이루어져 있다.

Encoder-Decoder
Transformer 는 encoder-decoder 구조로 되어있다. Encoder 에서는 input token 을 bidirectional conditioning 을 통해, 모든 input-token 끼리 서로 볼 수 있으며, decoder 에서는 autoregressive 하게 target sequence 를 token-by-token 예측한다. Decoder 의 self-attention layer 에서는 causal masking pattern (그림2 오른쪽) 을 통해 future token 을 보는 것을 방지한다. Encoder-Decoder 구조를 활용하는 Pre-trained Language model (PLM) 에는 BART, T5 등이 있다.

Causal decoder-only
최신 LLM 들은 전부 Transformer variant 이지만, 최신 LLM 들은 decoder-only 구조를 많이 사용한다. Decoder-only 구조는 single text stream 을 input 으로 하여, past token 으로 부터 autoregressive 하게 다음 token 을 예측한다. 이렇게 할 경우, conditioning text 에 대해서는 weaker representation 을 가지지만, generation 과 같은 autoregressive prediction 에는 자연스럽게 잘하는 모델이 된다. GPT series(GPT-1,2,3) 가 이러한 decoder-only 구조 에 속한다.

Non-causal decoder-only
Decoder-only 구조에 input/conditioning text 에 대한 richer representation 을 build 하기 위해, attention mask 수정을 통한 간단한 방법이 제안되었다. self-attention masking pattern 을 그림 2의 중간과 같이 바꿔줌으로써, 구현할 수 있다. 이러한 구조를 prefix Language model ([10])이라고도 한다.

Encoder-only
BERT 와 같이 transfomer encoder block 만 사용하는 경우도 있다. 이러한 경우, NLU task 는 잘 풀지만, NLG task 에 대해서 매우 취약한 모습을 보인다.

Comparisons across architecures
Decoder-only model 은 모든 sequence 를 decoder 에서 처리하고, encoder-decoder 의 경우, input 은 encoder 에서, target 은 decoder 에서 처리한다. 따라서 같은 계산량 을 가져가기 위해서는 encoder-decoder 구조가 decoder 구조보다 두 배의 메모리(파라미터) 를 가지게 된다.

Pre-training objectives
그림 3 에 pre-training objective 에 대한 내용이 있다.

Full Language modeling
GPT-2 이후로, large-scale decoder-only 모델이 autoregressive NLG 에서 좋은 결과를 보인다. FLM 은 이전의 token 들로 부터 바로 다음 token 을 예측하는 modeling 기법이다.

Prefix Language modeling
non-casual decoder-only model 과 encoder-decoder model 들이 Language modeling (LM) 을 수행하기 위해, prefix 를 지정할 수 있다. FLM 과 비슷하게, model 은 이전의 token 들로부터 바로 다음 token 을 예측하지만, prefix 는 고정되어 bidrectional 하게 볼 수 있다. 앞으로 이 논문 소개에서 PLM 은 prefix langugage modeling 을 의미 한다.

Masked Language modeling
Input token 의 일부가 special [Mask] token 으로 대체된 후, 이를 예측하는 modeling 기법이다. 연속되는 token 을 하나의 mask 로 처리하는 span corruption 기술 등이 사용되기도 한다.

Model adaptation
Adaptation 은 기존의 pre-training 기법을 다른 objective, 또는 다른 architecture 로 확장시키는 방법을 의미한다. Fine-tuning 과 다르게, downstream data 가 전혀 사용되지 않으며, only additional pre-training data 만이 사용된다. Language modeling adaptation (LM-A) 는 보통 MLM 으로 학습된 모델을 PLM, FLM 으로 확장시킨다. 이는 MLM 으로 학습된 (NLG 에 약한) encoder-decoder model 을 NLG 에 사용되기 위해 적용된다. 이는 prompt tuning 이 제안되기 전부터 사용된 방법이고, T0 에서 multitask finetuning 전에 model 설계에 사용된다.

Multitask fine-tuning
보통 pre-training 은 web crawling 으로 많은 corpora 를 긁어모은 뒤 학습을 진행한다. 이후, curated high-quality corss-domain data 들에 대해 fine-tuning 을 진행하면, zero-shot generalization 이 좋아지는 것을 확인할 수 있다. MLM + encodeer-decoder model 의 T0 model 과 FLM + causal decoder-only model 에서 multitask fine-tuning 이 zero-shot 성능을 좋게한다는 연구결과를 볼 수 있다. 이들은 task 에 prompt 를 붙여 fine-tuning 을 진행한다. 논문에서는 T- 학습을 위해 사용된 dataset 과 prompt 를 multitask fine-tuning 을 위해 사용한다.

Zero-shot evaluation
Radford et al. 은 처음으로 LLM 들이 zero-shot 성능이 매우 좋다는 것을 보였다. Zero-shot 은 prompting 기술에 의존하는데, 이는 task 를 자연어 형태의 포맷으로 포맷화시키는 것이다. 이 때 사용된 템플릿이 prompt 이다. 불행히도, prompt 에 따라 성능이 sensitive 하게 달라진다. 최근 zero-shot capability 에 대한 주목도가 높아지는 것은 labeld example 이 필요없고, unseen task 에 대해서 fine-tuning 에 대한 complexity 가 사라지기 때문이다.

Methods

모든 <architecture, objective> pair 가 C4 의 168 B token 으로 학습이 된다. 이후, multi-task finetuning 을 고려하여 zero-shot 성능을 측정한다. 또, adaptation 이 architecture/objecdtive 변경으로부터 효과적인 이득을 얻을 수 있는 가능성을 확인한다.

Compute budget guideline
모든 모델은 비슷한 training budget 을 갖게 설계한다. 대략적으로 15petaflops per day 를 갖게 하고, 이는 83만 TPUv4-hours 이다. 메모리는 고려하지 않았다.

Architecture

앞서 말한대로 computational cost 를 동일하게 하기 위해, architecture 들이 구성된다. 위의 table 에 자세한 사항이 기록되어 있다.

Pre-training

MLM 은 T5 model 에서 사용한 span corrpution objective 를 사용하였다. Computing budget 을 맞추기 위해, pre-training 에서 loss 계산 시 사용되는 token 수 대신 pre-training 에 사용되는 token 의 수를 조절한다. 예를 들어, Full language modeling 의 경우, 모든 token 이 loss 계산에 사용되고, prefix language modeling 에서 prefix 는 loss 계산에 사용되지 않는다. 평균적으로, FLM 에 비해 PLM 은 반절의 token 이 loss 계산에 사용된다. MLM 의 경우, T5 와 같이 15% 의 input token 이 length 3 의 span mask 로 corrpution 되고, 평균적으로 18% 정도의 token 이 loss 계산에 사용된다.

Multitask finetuning
Pre-training 이후 13B token 으로 구성된 T0 training dataset 에 fine-tuning 한다. Dropout 이 zero-shot 성능에 큰 영향을 미치는 것을 발견하여 추가한다.

Evaluation
T0-Eval 은 각 task 마다 multiple prompt 를 제공하고, EAI-Eval 은 하나의 태스크당 하나의 prompt 만을 제공한다. T0-Eval 은 prompt 별로 중간값을 취하고, 11 task 에 평균값을 취해 report 하였다. 42B, 84B, 168B token 들에 대해 model checkpoint 를 저장하였다.

Experiments

After self-supervised pretraining only
첫 번째로, self-supervised learning 학습 이후 zero-shot 성능을 본다. MLM 은 알맞지 않기 때문에, 사용되지 않았다.

Causal Decoder-only + FLM 모델이 가장 좋았고, non-causal decoder-only + PLM 가 뒤따르며, encoder-decoder + PLM 는 좋지 못하다.
T0-Eval 에서의 실험결과는 random-baseline 과 크게 차이가 없지만, EAI-Eval 에서는 차이가 있다.

After multitask finetuning

Decoder-only + FLM 구조가 zero-shot 성능은 더 좋고, Encoder-Decoder + MLM 구조가 fine-tuning 이후 성능이 더 좋다는 것이 이미 여러 연구에서 보여졌다. 따라서 저자들은 모든 architecture/objective 조합을 multitask fine-tuning 을 한 뒤 실험을 진행한다. 실험 결과는 위에서 불 수 있다.

EAI-Eval set 에 대하여, Encoder-Decoder + MLM 의 결과가 가장 좋았고, non-causal decoder with MLM 이 거의 비슷하게 뒤따랐다.
T0-Eval 에서는 확연한 차이가 나타나는데, Encoder-Decoder + MLM 의 성능이 다른 모델들에 비해 압도적으로 좋았다
Encoder-decoder + PLM 이 가장 좋지 못한 성능을 보여준다.

Influence of the tasks and prompts used for zero-shot evaluation

EAI-Eval 과 T0-Eval 은 거의 모든 task 가 겹치는데 (T0-Eval 의 11개 task 중 10 개가 EAI-Eval 에 존재), prompts 는 항상 다르다. EAI-Eval 은 Brown et al. 로 부터, GPT-3 에 최적화된 hand-tuned prompt 를 사용한다. 반면, T0-Eval 은 집단 지성을 통해 각 primary goal 을 높이기 위한 prompt 를 사용한다. 이러한 점에서, EAI-Eval 에서의 결과가 T0-Eval 에서의 결과보다 좋으며, 이는 causal decoder-only + FLM 에서의 without multitask (After self-supervised pretraining only section 의 결과) 에서 도드라지는데, causal decoder-only + FLM model 이 GPT-3 와 거의 유사한 구조이기 때문이다. 따라서, 저자들은 EAI-Eval 에서 사용되는 prompt 를 모든 task 에 적용하여 T0-Eval 에서도 적용하여 보았다. 결과는 위의 그림과 같다. EAI-Eval 과 T0-Eval 에서 겹치는 task 들은 성능이 확 좋아졌다. Prompt 를 빌려주기 전에는 차이가 나는 것에 비하면, prompt 의 효과가 상당하다는 것을 확인할 수 있다. 반면, T0-Eval 에 없는 task 에서 causal decoder performance 가 엄청나게 올라갔고, 특히 LAMBADA 라는 task 에서 매우 큰 차이를 보였다.

Can models be adapted from one architecture/objective to another?

앞선 실험 결과에서, multitask fine-tuning 이 zero-shot 성능 결과에 지대한 영향을 미치는 것을 볼 수 있다. Multitask fine-tuning 을 진행하지 않았을 때는 decoder-only model + FLM 이 zero-shot 성능이 좋았고, multitask fine-tuning 을 진행한 후에는 encoder-decoder + MLM 이 성능이 훨씬 더 좋았다. 이는 불편한 진실을 담고 있는데, multitask fine-tuned encoder-decoder model 은 open-ended generative task 에 잘 맞지 않으며, multitask fine-tuned decoder-only model 은 많은 zero-shot task 에서 best 결과를 보이지 않았다. 이에 저자들은 adaptation 실험을 진행한다.

Language modeling adaptation (LM-A)

Non-causal decoder-only + MLM -> causal decoder + FLM 으로 adaptation 한다. 이 adaptation 은 simple 한데, architecture 구조는 그대로 두고, attention mask 만 변경하면 된다. 실험 결과, Validation loss 기준으로 같은 성능을 보이는데 168B token 을 봐야하던 것에서, 105B 로 줄어들어 1.6 배 빨라진 것을 알 수 있다.

Non-causal masked language modeling adaptation(NC-A)
이번엔 새로운 adaptation 방법을 소개한다. : non-causal masked language modeling 기법이다. Causal decoder-only + FLM -> non-causal decoder-only + MLM 으로 adaptation 시킨다. 이는 위의 Language modeling adaptation (LM-A) 의 역과정과 같으며, 방법은 역시 단순하게 attention mask 를 변형시킴으로써 구현 가능하다. Validation Loss 는 Figure 6. 의 오른쪽에서 볼 수 있다. 기존의 MLM 기반의 decoder-only 모델들보다 3.3배 내지 9.1 배 빠르게 수렴한다. 이 adaptation 방법으로 single model 의 1.3 배 cost 만으로 zero-shot model 과 excellent generative model 을 얻는 것이 가능하다.

마지막으로, validation loss 의 improvement 가 zero-shot improvement 로 이어지는 것에 대한 실험 결과이다. 저자들은 adapted non-causal + MLM 모델이 기존의 causal + FLM 보다 zero-shot 성능이 더 좋은 것을 확인했다. 실험은 causal decoder + FLM with 219B tokens before multitask fine-tuning, causal decoder + FLM with 219B tokens after mulitask fine-tuning, causal decoer + FLM with 168 tokens + MLM-adapted as an non causal for 51B token after multitask fine-tuned. 세 모델에 대해서 진행하고, 이후 세 모델은 13B tokens 으로 한 번 더 multitask fine-tuning 을 진행하였다.

결과는 위와 같고, Adaptation 의 효과가 매우 좋은 것을 볼 수 있다.

[CVPR 2022 Tutorial] Denoising Diffusion-based Generative Modeling: Foundations and Applications(1)

Sat, 05 Nov 2022 10:12:00 +0000

[blog] [youtube]

Karsten Kresis¹, Ruiqi Gao², Arash Vahdat¹
¹ NVIDIA ² Google Brain

이 포스트는 CVPR2022 Tutorial : Denoising Diffusion-based Generative Modeling 을 기반으로 작성한 내용을 담고 있습니다.

Deep Generative Learning

Learning to generate data

Generative model 은 data distribution 으로 부터 학습(train)한 후, 추론(inference) 시에 하나의 sample 을 generation 하는 모델을 의미한다. Generative model 은 Content Generation, Representation Learning, Artistic Tools 등에서 이미 굉장히 좋은 성능을 보이고 있다.

현재까지 GAN(Generative Adversarial Networks) 를 필두로, VAE(Variational Autoencoders), Energy-based models, Autoregressive models, 그리고 Normalizing Flows 에 이르기까지 Computer Vision 분야에서 많은 generative model 이 연구되어 왔지만, 새롭고 강력한 (new and strong) Denoising Diffusion Models 가 이들을 섭렵해갈 것이라고 예상하고 있다.

그림에서 볼 수 있듯이, 최근 연구되는 Denoising Diffusion model 은 ImageNet 과 같이 Challenging 한 dataset 들에 대해서도 굉장히 좋은 퀄리티의 이미지를 생성할 수 있고, 또 다양한 결과를 내어놓는다. 왼쪽은 openAi 에서, 오른쪽은 Google 에서 연구된 최신 diffusion model results 이다. 이것들은 GAN 을 뛰어넘는 성과를 보였다. Diffusion model 은 이미 super-resolution, text-to-image generation 에서 매우 강력한 성능을 보여준다.

Denoising Diffusion Probabilistic Models

Denoising Diffusion model 은 두 가지 process 로 구성된다.

(1) Forward diffusion process that gradually adds noise to input
(2) Reverse denoising process that learns to generate data by denoising

첫 번째로, forward pass 에 대해서 살펴보면,

위의 그림과 같이, T-step 동안 normal distribution 같은 noise 를 단순하게 추가해주는 방식으로 진행된다. $\beta$ (noise schedule) 값은 0.0001 정도로 작은 값으로 설정된다. 이후 Join probability 가 Markov Process 로 생성이 된다.

Diffusion Kernel

Forward process 는 simple gaussian kernel 의 markov chain 이기 때문에, step 을 건너뛸 수 있다. Diffusion kernel 로 불리는 이 방법은 아래와 같다. 마지막 step 에서는 white noise 만 남게 $\alpha$ 값이 0 이 되게끔 noise schedule 이 design 된다.

지금까지는 conditional disturbition $q(x_t | x_0 )$ 를 생각했는데, 그렇다면 diffused data distribution $q(x_t)$는 어떻게 정의될까?

위의 그림에서, input data dist. $x_0$ 에 대해서, 최종 $x_T$ 까지 가는 동안 Diffused data distrubition 이 noise 로 smooth 해지는 것을 볼 수 있다. 따라서, diffusion kernel은 step 을 진행할 수록 distribution 을 smoother and smoother 하게 해주는 Gaussian convolution 이다.

Generative Learning by Denoising

이제 반대로, 어떻게 standard normal distribution 에서 sample 을 해서 원하는 data distribution value 를 얻을 수 있을까? 우리는 $q(x_t)$ 의 diffusion dist. 를 가지고 있으므로, 반복적으로 $x_{t-1}$ 를 True Denoising Dist. 를 활용해 sample 하면 된다. 그러나 문제는, 이 denoising distribution 이 intractable 하다는 것이다. 즉 다시 말해, 이 dist. 에 access 할 수 없다는 것이다. 이 식에서 $q(x_{t-1})$ 는 미래의 dist. 이기 때문에 접근할 수가 없기 때문이다. 따라서, 우리가 해야할 것은 approximation 이다. 이 때 중요한 것은 each step 의 noise schedule $\beta$ 값이 굉장히 작아야 한다는 것이다.

Score-based Generative Modeling with Differenital Equations

Advanced Techniques : Accelerated Sampling, Conditional Generation, and Beyond

Application (1) : Image Synthesis, Text-to-Image, Controllable Generation

Application (2) : Image Editing, Image-to-Image, Super-resolution, Segmentation

Application (3) : Video Synthesis, Medical imaging, 3D Generation, Discrete State Models

Conclusions, Open Problems

[ICML2022] Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts

Sat, 05 Nov 2022 10:12:00 +0000

[pdf] [github]

Yan Zeng¹, Xinsong Zhang Chaganty¹, Hang Li¹
¹ByteDance AI Lab. Correspondence to: Yan Zeng.

Abstract

기존의 Vision-and-Language 방법들은 object detection 을 이용한 object-centric feature 에 의존하여 학습됨.
이러한 방법으로는 multiple object 들의 relation 을 배우기 어렵다는 단점이 있음.
본 논문에서는 multi-granularity 를 통해 이 문제를 해결하여, state-of-the-art 를 달성함.

Introduction

현재 Vision-Language task 들은 대부분 Pre-trained Vision-Language Model (VLM) 의 fine-tuning 을 통해 좋은 성능을 보이고 있다. 현재 방법들은 대부분 위의 그림의 (a), (b) 의 두 approach 를 활용한다. 첫 번째 (a) 의 경우, object detction 모델을 미리 활용하여 object 를 뽑아놓거나 ([1], [2], [3]) on-the-fly 로 object detection 을 활용 ([4], [5])하여 fine-grained (object-centric) feature 와 text 를 align 한다. 그리고 두 번째 (b) 의 경우, object detection 을 활용하지 않고, coarse-grained (overall) feature 와 text 를 align 하여 학습([6], [7], [8])을 한다. 두 방법 모두 단점이 존재하는데, fine-grained 의 경우, 몇몇의 detected object 들이 text 와 관련이 없고, multiple object 들의 서로간의 relation 을 잡아내기 어렵다. 그리고, 모든 downstream task 에 맞춰 카테고리를 pre-define 하기 어렵다. 한편, coarse-grained 의 경우, object-level feature 를 배울 수 없기 때문에, visual reasoning, visual grounding, 그리고 image captioning 과 같은 downstream task 에 대하여 좋지 못한 성능을 보인다. 이 논문에서는 두 방법의 장점을 취하기 위해 obejct-level 과 image-level 에 국한되지 않은 multi-grained alignment 방법을 제안 한다.

본 논문에서는 위의 그림과 같이 multi-grained vision-language pre-trianing 방식을 위해, 세 가지 방향으로 trainign data 를 구성한다: 1) 전체 이미지를 설명하는 image caption, 2) “man wearing backpack” 과 같이 region 을 설명하는 region description, 그리고 3) “backpack” 과 같은 detector 가 찾아낸 object label. 따라서 training data reformulation 을 통해, 하나의 image-text pair 가 여러 image (region) 과 여러 text label 을 갖게 변형되고, “visual concept” 이라는 개념은 이 논문에 한해 object, region, 그리고 image 를 모두 설명하는 개념이 된다.

제안되는 X-VLM 모델은, two-stream 구조로, image encoder, text encoder 그리고 fusion cross-modal encoder 로 구성된다. X-VLM 의 학습은 두 가지 Loss 로 진행된다: 1) visual concept 을 associated text 에 따라 location 시키는 box-regression, IOU loss 2) text 를 visual concept 에 align 시키는 contrastive loss, matching loss, Masked Language Modeling(MLM) loss.

X-VLM 은 여러 가지 downstream task 에서 강력한 성능을 보인다. image-text retrieval 에서 VinVL 모델보다 4.65% 높은 성능 (R@1 on MSCOCO) 을 보였고, ALIGN, ALBEF, 그리고 METER 와 같은 최신 transformer 기반 모델보다 좋은 성능을 보인다. Visual Reasoning task 에서는 VinVL 보다 VQA 에서 0.79%, NLVR2 에서 1.06% 높은 성능을 보인다. 심지어 1.8B 의 거대한 in-house data 로 학습한 $SimVLM_{base}$ 보다도 높은 성능을 보였다. Visual Grounding 에서는 UNITER 보다 4.5% 향상, MDETR 보다 1.1% 향상을 이루어 냈다. Captioning 에서는 $SimVLM_{base}$ 과 유사한 성능을 보여주었다.

Method

X-VLM 은 two-stream framework 로, image encoder ($I_{trans}$), text encoder ($T_{trans}$), 그리고 cross-modal encoder ($X_{trans}$) 세 가지로 이뤄져 있다. 세 encoder 모두 transformer 를 기반으로 한다. 저자들은 pre-training dataset 을 bounding box 와 associated text 가 있는 region, object 로 나누었고, $(I, T, {(V^j, T^j)}^N ) $ 으로 표기한다. 어떠한 경우 associated text 가 없어 T 가 NaN 일 때도 있고, bounding box 가 없어 N=0 일 때도 있다. 해당하는 boudning box $b_j$ 는 (cx, cy, w, h) 로 normalize 된다. 전체 image itself 의 bounding box $b$ = (0.5, 0.5, 1, 1) 이 된다.

Vision Encoding
multi-grained visual concept 을 생산하기 위해, visual encoder 가 구성된다. Vision transformer 를 기반으로 하는 이 encoder 는 image 를 non-overlapping patch 로 split 한 후, 모든 patch 를 linearly embedding 한다. 224 x 224 resolution 의 image 가 32 x 32 patch 로 embdding 되어, 총 49 개의 patch 가 생산된다. 각 patch $v_{p_i}$ 는 corresponding patch information $p_i$ 를 갖는다. 그림의 왼쪽에서와 같이, Vision Transformer 통과 후, patch feature 는 position information 을 keeping 한 채, {$v_{p_1^j},…,v_{p_M^j}$}$\cdot${$p_1^j, …, p_M^j$} 의 형태로 reshape 되어 $V_j$ 를 이룬다. 이후 $v_{cls}^j$ 로 denote 되는 feature average 값이 prepend 된다. 이러한 방법으로 image encoder 는 $N+1$ 개의 concept representation 을 생성한다. $I_{trans}(V^0)$ 는 모든 patch 정보가 활용된 image representation 이다.

Bounding Box Prediction
언급한대로 multi-granularity 에 대해, visual concept 을 corresponding text 에 locating 시키는 것과 동시에, text 를 visual concept 에 align 시키는 방법으로 모델이 학습된다. 그림의 bounding box stream 과 같이, cross-modal encoder 의 [CLS] token embedding 에 MLP head 를 붙여 학습된다.

보통 bounding box 는 L1 loss 를 통해 학습되지만, scale 문제에 민감하기 때문에, IOU loss 를 결합하여 scale-invariant 한 loss 를 구성하여 학습한다.

Contrastive Learning
주어지는 (visual concept, text) pair 에 대해, in-batch contrastive loss 를 구성해 cross-modal encoder 를 학습한다. Multi-granularity 에 대해 해당 visual concept 은 object, region, image 를 모두 포함한다. score function $s(V,T)$ 는 cosine similarity 이고, 각각 [CLS] token embedding 이 score 측정을 위해 사용된다.

위는 각각 vision-to-text, text-to-vision similarity 식이고, $\tau$ 는 learnable temperature parameter 이다. 최종적으로, 아래와 같이 contrastive loss 를 구성한다.
H 는 cross entropy loss 이다.

이 loss 의 이해를 위해 기존 CLIP model 의 contrastive loss 구성을 위한 그림을 첨부한다.

Matching Prediction
위의 contrastive loss 를 구성하기 위한 hard negative sample 을 하나 추출해와, matching prediction 을 진행한다.

Masked Language Modeling
각 word token 이 25% 확률로 선택되고, 선택된 mask token 은 80% 확률로 [MASK], 10% 확률로 random token 으로 바뀌고, 10% 확률로 바뀌지 않는다. 이렇게 구성된 maksed sentence $\hat{T}$ 에 대해, MLM 을 위해 cross-entropy loss 가 구성된다.

최종적인 loss 는 다음과 같다.

Experiment

Pre-training Datasets

Image-Text Retrieval

Visual Reasoning(VQA and NVLR2), Visual Grounding and Image Captioning

Grad-CAM visualization

Ablation Study

Conclusion

quoted from paper

We propose performing multi-grained vision language pre-training to handle the alignments between texts and visual concepts.
We propose to optimize the model (X-VLM) by locating visual concepts in the image given the associated texts and in the meantime aligning the texts with the visual concepts, where the alignments are in multigranularity.
We empirically verify that our approach effectively leverages the learned multi-grained alignments in finetuning. X-VLM consistently outperforms existing state-of-the-art methods on many downstream V+L tasks.

[ICML2022] Dialog Inpainting: Turning Documents into Dialogs

Tue, 01 Nov 2022 01:01:00 +0000

[pdf] [github]

Zhyun Dai^{* 1}, Arun Tejasvi Chaganty^{* 1}, Vincent Zhao^{* 1}, Adia Amini¹, Qazi Mamunur Rashid¹, Mike Green¹, Kelvin Guu^{* 1}
^*Equal Contribution ¹Google Inc. Mountain View, USA.

Abstract

기존 ConvQA 의 scarce training data 문제를 해결하기 위해, document 로 부터 dialogue 를 생성하는 dialogue inpainting 방법을 제안한다.
dialog inpainting 방법으로 기존 ConvQA 의 1,000 배 크기의 WikiDialog, WebDialog 두 데이터셋을 생성한다.
생성한 두 데이터셋을 학습에 활용하여 ConvQA Retreival system 에서 기존 State-of-the-Art model 보다 무려 40% 가량의 성능 향상을 보이는 모델 을 제시한다.

Introduction

최근 information-seeking tool 들은 잘 정의된 question (ex “Where was Barack Obama born?”) 에 대해 좋은 성능을 보이지만, “How to eat healthier?” 와 같 은 conversation 속의 context 와 깊이 있는 이해를 동반한 open-ended 질문에 대해서는 잘 풀지 못한다. 이러한 문제를 해결하기 위해, ConvQA task 가 제안되고 연구가 되고 있다. 그러나 crowdsourcing 의 난이도와 도메인 지식 부족의 이유로 training data 를 구축하는데 비용이 많이 들고 어렵다. 이러한 문제로 현재 ConvQA system task 의 데이터셋들은 대략 10,000 개 정도의 적은 사이즈를 갖게 된다.

한편, Wikipedia, PubMed 와 같이 high-quality document 는 굉장히 풍부하다. 이러한 document 는 전문가들이 작성하는 경우가 많으며, crowdsourcing 으로 얻기도 쉽다. 저자들은 이러한 점에 착안해 이러한 높은 품질의 document 로 부터 dialog 를 만드는 방법을 제안한다. document를 dialog 형식으로 만들기 위해서는, system 이 질문하고 writer 가 답변을 하는 형식으로 진행되는데, writer 의 답변은 document 의 phrase 를 사용하면 되므로 이미 정해져 있다. 이 상황에서 system 이 적절한 question 을 묻도록 만들면 되는데, 이는 마치 옆에서 다른 사람이 전화를 하고 있을 때, 전화 건너 상대방의 말을 유추하는 것과 유사하다. 저자들은 이러한 상황을 dialog inpainting 이라고 표현하는데, inpainting 은 computer vision 에서 mask 되거나 오염된 부분을 채우는 방법을 말한다.

이 Inpainter 를 통해 저자들은 Wikepedia와 web data 로 부터, WikiDialog, WebDialog 을 생성한다. 두 데이텃셋의 크기는 19M+ 으로, 기존의 ConvQA 의 가장 큰 데이터셋보다 1,000배가 큰 사이즈이다. 이후, conversiotionality 와 answer adequacy 측면에서 생성된 데이터셋을 평가하고, 이 데이터셋을 학습한 모델이 ConvQA system 에 얼마나 좋은 성능을 보이는지 검증한다.

실험 결과, ConvQA retreival benchmark (QRECC, OR-QUAC, TREC-CAST) 에서 기존 State-of-the-Art 모델 보다 40% 가량 좋은 성능을 보여주었으며, zero-shot 실험에서도 finetuning 의 95% 에 해당하는 강력한 성능 을 보여준다.

Dialog Inpainting

Dialog Inpainting 을 하기 위해 Inpainter 를 우선 학습한다.

Notation

complete dialog $d$
$d=(u_1, u_2, …, u_t, …, u_T)$

unobserved utterances
$@$ symboal e.g. $(u_1, u_2, @, u_4, @)$

shorthand sign that denote dialog $d$ with utterances 3 and 5 masked
$d_{m(3,5)}$

Inpaint( $d_{m(3,5)}$ ) = $(u_1, u_2, \hat{u_3}, u_4,\hat{u_5})$

Training: Dialog reconstruction

Inpainter training 시 에는 random 하게 하나의 utterance 를 mask 한다.

$ d_{m(t)} = (u_1, …, u_{t-1}, @, u_{t+1}, …, u_T)$

이후, BERT 와 마찬가지로 Maximum Likelihood Estimation(MLE) 방법으로 학습한다. BERT 에서는 token 하나가 mask 로 되었다면, 이 경우 utterance 하나가 mask 된다.

Inpainter 로는 T5 가 사용된다. Input 이 하나의 utterance 를 mask 한 text 이고, output 이 하나의 utterance 이기 때문에, text-to-text trasfer transformer 인 T5 가 이상적인 선택이다.

Inference: Transforming documents into dialogs Inpainter 의 역할은 document 를 dialog 로 바꾸기 위함이다. 원하는 document 혹은 passage $p$ 가 $(s_1, s_2, …, s_m)$ 의 문장으로 이뤄져 있을 때, 이 것이 answer 라고 생각하고 question 을 만드는 것이 inpainter 의 역할인 것이다. 따라서 원하는 결과는 $(@,s_1, @, s_2, @, s_3, …, @, s_m)$ 의 형태의 dialog 이다. 이 때, 저자들은 speaker 에게 부족한 정보를 hint 로 제공하기 위해 prompt 를 앞에 붙여준다. prompt 의 형식은 “Hello, I am an automated assistant and can answer questions about (document title)” 이다.

최종적으로, 원하는 parital dialog 는 아래와 같이 입력되어 inpainting 되길 원한다.

$ ParticalDialog(p) = (s_{prompt}, @, s_1, @, …, @, s_m). $

그러나, training 단계에서는 mask $@$를 dialog 당 하나의 utterance 만 학습하도록 하기 때문에, 이렇게는 inference 가 되지 않는다. 따라서, 저자들은 $(s_{prompt}, @, s_1)$ 에서 inpainting 을 한 번 한 이후, inpainting 된 utterance $\hat{u_1}$ 을 활용하여, $(s_{prompt}, \hat{u_1}, s_1, @, s_2)$ 의 이어지는 inpainting 을 하도록 설계하였다. 이런 식으로 모든 mask 가 채워질 때까지 반복한다.

Applying dialog inpainting to generate an information sekking dialog dataset

저자들은 inpainter 를 학습한 뒤, 위의 Inference 방법으로 dataset 을 생성한다. Inpainter 의 학습에 사용된 dataset 은 PublicDialog, TaskMaster, QR-QuAC, 그리고 QReCC 이다. 이 중, 앞의 두 데이터셋은 양이 많지만, explicit question answering 을 포함하지 않으며, 뒤의 두개는 크기가 작다. 이를 나누어 저자들은 모델을 학습하여, 앞의 두 개만 학습한 $Inpainter_{PT}$, 뒤에 두 개를 학습한 $Inpainter_{QQ}$, 전부 학습한 $Inpainter_{PTQQ}$ 세 모델을 제시한다.
이후, Qr-QuAC retrieval corpus 속의 5.9M 크기의 Wikipedia article 과, Ms Marco retrieval corpus 속의 8.4M 크기의 English web passage 에 inference 방법을 적용하여 dataset 을 생성한다. Inpainter 모델에 따라 생성되는 dataset 은 $WikiDialog_{PT}$, $WikiDailog_{QQ}$, $WikiDialog_{PTQQ}$, 그리고, $WebDialog_{PT}$ 가 생성된다.

Evaluating WikiDialog as a Dataset

저자들은 생성된 데이터셋들을 human evaluation 을 통해 평가한다.

How information seeking are the generated utterances?

Rater 들은 dialog 가 information-seeking 한지 여부에 대해 $WikiDialog_{PT}$ 에 94.5점을, 나머지의 경우 99~100 점을 부여하였다.

How well answered are the generated questions?

Rater 들은 Answer Adequacy 에 대해, question 에 대해 answer 가 적절한 지 여부에 대해 평가하였고, 충분히 적절함을 표에서 볼 수 있다.

How conversational are the data?

Rater 들은 Conversionality 에 대해 평가하였고, 생성된 dialog 속의 대화들이 자연스럽게 이어진다고 평가하였다.

What types of questions are generated?

Dialog 시작은 정형적인 definitional question (e.g. “what is”, “who is”, “where is”) 등으로 시작하지만, 이후 follow-up utterance 들에서는 “did, “is there”, “how” 와 같은 diverse 한 질문이 생성되는 것을 볼 수 있다.

Application : Open-domain Conversational Retrieval

Open-domain Conversational QA 는 해당 question 에 해당하는 정보를 추출해오는 retrieval part 와, retreived passage 와 dialog 정보를 통해 다음 utterance 를 생성하는 generator 단으로 구성된다. 이 논문에서는 generator 는 future work 으로 남겨두고, retriever 에 집중하여 실험을 진행한다.

위의 그림과 같이, two-stage ConvQA retrieval system 을 활용한다. 일단, dual-encoder 구조의 Retriever 에서, dialog history 와 passage 를 각각 embedding 한 후 top-K 개의 passage 를 추출한 이후, corss-attention model 을 이용해 Reranker 에서 다시 점수를 측정하여 retrieval 한다.

생성한 WikiDialog 와 WebDialog 로 Pre-training 할 때, 추출해오려는 label passage 는 원래의 answer sentence 를 구성하던 document 이므로, 이를 그대로 활용하면 string-match 를 학습할 확률이 높다. 따라서, 저자들은 추출해오려는 label passage, 즉 원래 answer sentence 가 포함된 문서에서, dialog 를 구성하는 answer sentence 를 제거한 후 passage 를 찾아오게 학습을 진행하였다. 이후, Fine-tuning 단계에서는 downstream ConvQA dataset 에 대해서 fine-tuning 을 진행하였다.

Evaluation

Dataset , Baseline model, and Metric

Task : ConvQA Retrieval System

Dataset : QR-QuAC, QReCC, TREC CAsT19, and CAsT20

Basline : BM25-Query Rewriter, MB25-T5QR, ANCE-Query Rewriter, CONQRR, and ConVDR

Metric : MRR@5, MRR

Experiment Results

기존 state-of-the-art 모델들의 성능을 엄청난 차이로 상회하는 것을 확인할 수 있다. (특히, Reranking 까지 사용할 경우)

Retriever performance when T5-Base DE

Inpainter 를 학습하기 위한 ConvQA dataset 중 Question answering 이 없는 WiKiD-PT model 의 성능은 기존 state-of-the-art 보다 좋았지만, QQ 를 썼을 때, 그리고 모두 사용했을 때 더욱 좋아진 것을 볼 수 있다.

Zero-shot/few-shot results

Inpainter 가 만든 WikiDialog 와 WebDialog 를 사용했을 때, QReCC data 에서 zero-shot 을 해도 무려 95% 의 성능을 보이는 것을 확인할 수 있다. 그만큼 본 연구에서 제안한 방법을 통해 ConvQA 에 강력한 representation 이 학습되었음을 알 수 있다.

[ICML2022] VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix

Mon, 31 Oct 2022 08:44:00 +0000

[pdf] [github]

Teng Wang^{1 2}, Wenhao Jinag³, Zhichao Lu¹, Feng Zheng¹, Rang Cheng¹, Chengguo Yin³, Ping Luo²
¹Department of Computer Science and Engineering, Southern University of Science and Technology ²Department of Computer Science, The University of Hong Kong ³Data Platform, Tencent

Abstract

기존의 vision-and-language pre-training (VLP) 방법들은 paired image-text dataset 에 의존하지만, 그 것들은 가공이 어렵고 human labor 가 많이 필요하다.
이 논문은 Large-scale text-only corpora 와 image-only corpora 의 데이터로부터, cross-modal CutMix (CMC) 라는 augmentation 방법을 통해 unpaired data 로 학습하는 방법을 소개한다. 이 방법은 위의 그림처럼 자연어 문장 속의 visually-grounded words 를 이미지 패치로 바꾸어 multi-modal sentence 로 만드는 방법이다. CMC augmentation 방법을 통해 aligned pair 가 적은 data 의 scarcity 를 극복 가능하고, token-level 의 denoising 이 향상된다.
추가적으로, VLMIXer 라는 새로운 contrastive learning 방법을 소개한다.

Introduction

현재 많은 Vision-and-Language pre-training (VLP) 태스크들은 MSCOCO, Visual Genome 같은 manually-labeled and well-aligned dataset 들과 high-capacity transformer model 을 통해 학습된다. 이 transformer 의 학습에서, (1) global-level 에서는 pre-training loss 로 사용되는 image-text matching loss 를 통해 alignment 를 학습하고, (2) instance-level 에서는 self-attention layer 가 두 modality 의 input token 들의 fine-grained interaction 을 학습한다. 그러나 이러한 well-aligned dataset 에 대해 모델의 성능은 이미 saturated 되었고, 이보다는 weakly-aligned pair 를 학습하여 모델을 scale-up 하려는 시도가 존재해왔다. 한 Unsupervised VLP (U-Visual BERT) 에서는 stand-alone image and text corpus 로 multi-modal representation 을 학습한다.

그러나 기존 연구에서는 image-tag 를 두 modality 를 연결(bridge)하기 위한 intermediate representation 으로 활용하는데, 이는 complex image 에는 적절하지 않다는 점을 지적한다. 또 이러한 방법으로는 NLVR, image-text retrieval 같은 fine-grained alignment 에 의존하는 downstream task 에 취약하다는 단점이 있다.

이 연구에서는 cross-modal CutMix (CMC) 방법을 통해 “multi-modal sentence” 를 생성하여 이를 해결한다. 그림의 방법처럼 image patch gallary 로 부터 자연어 문장의 visual-grounded word 를 patch 로 바꾸어 multimodal transformer 의 input 으로 넣어주면, 기존의 mask-then-predict 방법으로 token-level alignment 학습이 가능하다. 또 추가적으로, 두 모달리티의 효과적인 instance-level alignment 를 위한 contrastive learning framework 를 제안한다. 이는 multimodal sentence 와 그에 해당하는 text sentence 를 같은 semantic correspondance 문장들로 생각하여 가깝게하고, 그렇지 않은 negative sample 들을 멀게한다. 이를 통해 instance-level image-text alingment 학습을 효과적으로 수행할 수 있다.

기존의 방법들은 위의 그림에서 A.B.C 에 해당하는데, A.B 에 해당하는 Vision-and-Language 학습 방법은 image-text pair를 필요로 한다. 간단하게(Vinalla) multimodal input 을 다루는 A. 방법과 다르게 B. Oscar style 은 tag anchor를 활용한다. C. 의 U-Visual BERT 에서는 text 와 image pair 가 아닌 unpaired set 으로도 학습이 가능하다. 하지만, U-Visual BERT 는 텍스트에 해당하는 image-tag 만을 활용하기 때문에, visual region 과 linguistic cue 사이의 interaction 을 볼 수 없고, explicit 한 matching supervision(tag) 가 없을 경우 alignment 학습이 불가능하다는 단점이 있다고 지적한다. 논문에서 제안하는 VLMIxer 의 경우, patch tag 를 통해 첫 번째 문제점을 해결할 수 있고, contrastive loss 를 이용하여 tag 가 없는 두 번째 경우도 해결 가능하다.

VLMixer Pre-training

VLMIxer 는 두 가지 parallel pre-training branch 를 갖는데, 하나는 Visually-Aided Language Pre-training (VALP) 이고, 다른 하나는 Tag-Aided Visual Pre-training (TAVP) 이다. VALP 는 Cross-Modal Cutmix (CMC) 를 활용하고, TAVP는 image-only dataset 에서 image 만 주어질 때, image tag 를 text modality 로 하여 U-Visual BERT 와 같은 방법으로 학습을 진행한다.

Patch gallaery

image-only dataset 에서 off-the-shelf concept(patch) detector (ex. Faster RCNN) 를 활용하여 visual patch gallery 를 구성한다. w 는 concept label, c 는 confidence score 이다. 그리고 concept 주변의 “contextual concept” 을 추가적으로 저장한다. i-th concept 과 그에 해당하는 각 j-th contextual concept 들을 통해 다음 식과 같이 gallery 가 구성된다.

CutMix visual patches into sentence

CMC 의 각 word token 은 patcdh x_q with q ~ Norm({P_i}) 로 바뀐다. 식에서, G_i 는 i-th concept 의 “contextual concept” 들이고, 식을 해석하면, 자연어 문장의 워드 토큰에 대하여 diversity 를 위해 patch gallery 속의 concept 들을 그 각 주변의 contextual concept 을 고려한 확률을 부여한 뒤 normalize 하여 q 라는 확률을 부여한 뒤, q-distribution 에서 patch x_q 를 뽑는다 는 것이다. 이후, word token 이 x_q patch 로 바뀌는 것은 r_cmc 확률을 통해 결정된다.

K-shot CMC.

Divesity 를 위하여 r_cmc 확률을 통해 patch 로 바뀌는 과정을 K 번 반복하여 K 개의 concept 을 patch 활용한다. 따라서 최종적인 multimodal token 으로 이뤄진 문장은 아래와 같다.

Visually-Aided Language Pre-training (VALP)

Backbone 은 Vaswani Transformer 이며, Masekd Language Modeling (MLM) 과 cross-modal contrastive Loss (CMCL) 가 활용된다.

Masked language modeling (MLM).

MLM 의 방법은 기존의 BERT 와 유사하다. 15% 의 확률로 Mask 된다.

Cross-modal contrastive learning (CMCL).

Unpaired VLP 에서 contrastive loss 를 구성하기 위해, multimodal sentence S_M 과 그에 해당하는 바뀌기 전의 자연어 문장 T_M 에 대해 matching 되는 것을 positive sample, 그렇지 않은 것을 negative sample 로 하여 아래와 같이 contrastive loss 를 구성한다. f 는 [CLS] token 의 cosine similarity 이다.

Tag-Aided Visual Pre-training (TAVP)

TAVP 는 visual-only data 로 부터 multi-modal knowledge 를 추출하기 위해 활용된다. TAVP는 image-only dataset 에서 image 만 주어질 때, image-tag 를 text modality 로 하여 U-Visual BERT 와 같은 방법으로 학습을 진행한다. Oscar 와 같이 15% 확률을 통한 Mask-tehn-predict pre-training 을 통해 loss 를 구성한다.

최종적인 Loss 는 아래와 같다.

Experiments

Fair 한 비교를 위해 unpaired vision-and-language task 로의 진행을 위해 alignment information 없이 paired dataset 에 대해 성능 검증을 한다. Pre-training dataset 은 아래와 같다.

Comparison with State-of-the-Art Methods

Ablation Studies on pre-training objectives

Ablation of Cross-modal CutMix

Ablation study of the contrastive learning methods and data augmentations All models are pre-trained on COCO.

Downstream performance using different number of concepts in the patch gallery

Conclusion

*quoted from the paper

We propose cross-modal CutMix to construct a multimodal representation to bridge the images and texts, guiding the model to learn cross-modal alignment at the token level.
We propose cross-modal contrastive learning upon CMC to facilitate instance-level alignments between unpaired images and texts, where semantically similar instances are pulled closer and dissimilar instances are pushed away.
Extensive experiments on diverse downstream tasks show that our approach achieves superior performance over previous unpaired VLP methods.

[BEIT-3] Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasks

Tue, 06 Sep 2022 01:27:00 +0000

[pdf] [github]

Wenhui Wang, Hangbo Bao, Li Dong∗, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, Furu Wei
Mcirosoft Corporation

Abstract

Vision task 와 Vision-and-Language task 에서 State-of-the-Art 를 달성한 general-purpose model BEIT-3 을 소개한다.
논문에서 소개되는 General-purpose 를 위한 Multi-way Transformer 속의 modular arcitecture 가 deep fusion 과 modality-specific encoding 을 가능케 한다.
Masked “language” modeling 을 image 에 적용한 Imglish 방법과 text, image-text pair 에 적용한 unified 방법으로 pretraining 함으로써, object detection(COCO), semantic segmentation(ADE20K), image classification(ImageNet), visual reasoning (NVLR2), visual question answering(VQAv2), Image captioning(COCO), 그리고 cross-modal retrieval(Flickr30K, COCO) 에서 모두 state-of-the-art 를 달성하였다.

Introduction : The Big Convergence

최근 Language(BERT), Vision(BEIT, BEITv2), 그리고 Multimodal(VLMO, CLIP, Coca) 등의 강력한 Transformer 모델이 각 연구의 trend 를 이룬다.

그 중 Vision-and-Language task 에서는 세 가지 pretraining convergence trend 가 있다.

첫째로, Transformer 모델의 성공이 language 로 부터 vision, 그리고 multimodal 로 퍼지고 있다는 점이다. 그러나 Vision-and-Language 의 경우, downstream task 에 맞춰 Transformer 모델이 다른데, 직접 end-task format 을 Transformer의 구조에 맞춰줘야 한다는 단점이 있고, 또 paramtere 들이 modality 들을 잘 공유하지 못한다는 점이 있다. 이에 본 논문에서는 Multiway Transformers(BEIT) 를 차용하여 하나의 통합된 모델이 다양한 donwstream task 를 푸는 general-purpose 모델을 제안한다.

둘째로, Masked modeling 방법이 여러 모달리티에서도 성공을 거둔다는 점이다. 그러나 Pretraining task 를 위시한 masked modeling 방법에 대하여, 기존의 vision-and-language transformer 들은 image-text matching 같은 multitask 를 배우는데, 이러한 multitask pretraining 방법은 scaling-up 에 적합하지 않다. 따라서, 본 논문에서는 mask-then-predict 의 간단한 방법을 통해 통합하였는데, 이는 image 를 Imglish 라는 하나의 foreign language 로 생각하여 BERT 의 MLM(Masked Language Modeling) 과 같은 방식만 사용한다.

셋째로, model size 와 data size 를 키우는 것이 generalization quality 에 도움이 된다는 점이다. 본 논문에서는 이를 따라 수십억개(Billions)의 parameter 로 scaling-up 하였고, private data 없이 in-house data 만으로 큰 margin 으로 state-of-the-art 를 달성하였다.

본 논문에서는 위와 같이 Multiway Transformer 모델을 차용하는데, 앞서 언급한 것과 같이 text token 과 image patch 를 mask 한 후, predict 하는 self-supervised learning 방법만 이용한다. 첫 번째 그림과 표와 같이 본 논문에서 제시하는 BEIT-3 모델이 많은 vision task 와 vision-and-language task 에서 state-of-the-art 를 달성하였다.

BEIT-3 : A General-Purpose Multimodal Foundation Model

Backbone Network : Multiway Transformers

Backbone Architecture 로는 Multiway Transformers(VLMO)를 활용하였다. 그림에서 보듯이 shared self-attention 이 modality 들의 alignment 와 deep fusion 을 한 이후, 각 모달리티 별 expert network 가 학습된다. 본 연구에서는 vision, text, vision-and-text 의 3-way transformers 가 활용된다. 이 기본 backbone architecture 를 바탕으로 아래의 그림처럼 각 downstream task 에 맞게 BEIT-3 모델이 구성된다.

Pretraining Task : Masked Data Modeling

Prtining task 로는 masked data modeling(VL-BEIT) 를 활용한다. 이는 BERT 와 마찬가지 방법으로, word token 과 image patch 를 masking 한 후, predict 하는 방법으로 이 unifed mask-then-predict 방법이 modality 간의 alignment 의 학습에 도움이 된다. 또, pretraining task 로 오로지 이 방법 하나만을 사용함으로써 scaling-up 에 친화적이다. 기존의 vision-and-language model 들은 multiple pretraining task 를 활용하여, training process scaling-up 에 좋지 않으며, mask-then-predict 만 사용했을 때, 적은 배치 사이즈로도 학습이 잘 되는 것을 확인하였다. Model Spec 과 Pretraining Data는 아래와 같다.

Experiments on Vision and Vision-and-Language Tasks

앞서 언급했듯, BEIT-3 는 여러 Vision task 와 Vision-and-Language task 에 State-of-the-Art 를 달성하였다.

(1) Vision-and-Language Downstream Tasks
Visual Question Answering(VQA) / Visual Reasoning / Image Captioning

Image-Text Retrieval / Zero-shot Image-Text Retrieval

(2) Vision Downstream Tasks
Object Detection and Instance Segmentation

Semantic Segmentation

Image Classification

Discussion & Comments

결론의 문구에서, MultiLingual 로 확장하고 Audio 로 모달리티를 확장한 BEIT-3 에 대한 Future work 을 준비 중인 것 같다. ( “For future work, we are working on pretraining multilingual BEIT-3 and including more modalities (e.g., audio) in BEIT-3 to facilitate the cross-lingual and cross-modality transfer, and advance the big convergence of large-scale pretraining across tasks, languages, and modalities”) 간단하지만 강력한 성능을 보이는 Transformers 모델이 Vision-and-Language 에서도 확장이 되고, 초거대 Language Model 들의 방법론들이 차례로 Vision-and-Language task 에 적용이 되고 있다는 느낌이 든다.

Yongil's Research Blog

[EMNLP2023] Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Abstract

1. Introduction

2. Related Work

3. Data Construction

3.1. Questions about the World

3.2. Creation and Writing

3.3. Assistance on Existing Materials

3.4. User Simulation and Refinement

4. Data Analysis

4.1. Statistical Analysis

4.2. Human Assessment

5. Experiments

5.1. Experimental Setup

5.2. Benchmark Evaluation

5.3. Response Quality Evaluation

6. Conclusion

[ICLR2024] #INSTAG: INSTRUCTION TAGGING FOR ANALYZING SUPERVISED FINE-TUNING OF LARGE LANGUAGE MODELS

Abstract

1. Introduction

2. Related Works

3. INSTAG

3.1. OPEN-SET FINE-GRAINED TAGGING

3.2. TAG NORMALIZATION

3.3. QUALITY EVALUATION

3.4. PRELIMINARY ANALYSIS

4. INSTAG FOR DATA SELECTION

4.1. EXPERIMENTAL SETUP

4.2. RESULTS

4.3. DECOUPLED ANALYSIS

5. INSTAGGER: LOCAL TAGGER BY DISTILLATION

6. CONCLUSION

[Arxiv 2404]HyperCLOVA X Technical Report

Abstract

1. Introduction

2. Training Details

2.1. Pretraining

2.2. Alignment Learning

2.2.1. Supervised Fine-tuning (SFT)

2.2.2. Reinforcement Learning from Human Feedback (RLHF)

2.2.3. The Alignment Learning Pipeline

3. Core Benchmarks

3.1. Comprehensive Korean LLM Benchmarks

3.2. Comprehensive English LLM Benchmarks

3.3. Commonsense Reasoning

3.4. World Knowledge and Factuality

3.5. Mathematics

3.6. Coding Capabilities

3.7. Chat and Instruction-Following

3.8. Harmlessness

3.9. Comparison with Closed Source Models

4. Multilinguality

4.1. Cross-Lingual Reasoning

4.2. Machine Translation

4.3. Cross-lingual Transfer

5. Safe and Responsible AI

5.1. HyperCLOVA X Ethics Principles

5.2. Red Teaming and Safety Data Collection

5.3. Safety Evaluation

5.3.1. Toxicity

5.3.2. Social Bias

5.3.3. Human Evaluation

Conclusion

[ICLR2024] DP-OPT: MAKE LARGE LANGUAGE MODEL YOUR PRIVACY-PRESERVING PROMPT ENGINEER

Abstract

1. INTRODUCTIONS

2. PRELIMINARIES

2.1. Large Language Models (LLMs) and Prompt Tuning.

2.2. Differential Privacy

3. METHOD

3.1. TRANSFERABLE DISCRETE PROMPTS ENABLE OFFSITE PROMPT TUNING

3.2. DIFFERENTIALLY-PRIVATE OFFSITE PROMPT TUNING (DP-OPT)

4. EXPERIMENTS

4.1. PRIVATE OFFSITE PROMPT TUNING

4.2. ABLATION STUDIES

DISCUSSION AND CONCLUSION

[ICLR2024] LOFTQ: LORA-FINE-TUNING-AWARE QUANTIZATION FOR LARGE LANGUAGE MODELS

Abstract

1. Introduction