Ankita Bhaumik^†, Praveen Venkateswaran^∗, Yara Rizk^∗, Vatche Isahagian^∗
^† Rensselaer Polytechnic Institute, Troy, New York ^∗ IBM Research

Abstract

(TOD metrics) 많은 similarity metric 들이 제안되었 지만, task-oriented conversation 의 unique 한 특성을 알아내는 metric 에 대한 연구는 많이 진행되지 않았다.
( TaskDiff ) 이에 저자들은 TaskDiff 라는 conversational similarity metric 을 제안한다. TaskDiff는 utterances, intents, slots 와 같은 다양한 dialogue component 를 활용하여 그 distribution 을 통해 optimal transport 를 활용하여 계산된다.
(Experiments) 다양한 벤치마크에서 TaskDiff 가 superior performance 와 robustness 를 보인다.

1. Introduction

▶ A key aspect of conversational analytics
대화(conversation)에 대한 연구는 LLM 의 등장으로 가속화되었고, ChatGPT 나 LLaMA2 등을 활용한 assistant 들도 많이 등장하였다. 이것들을 통해 user-experience 가 develop 될 수 있다. 그러나, 여러 assistant 간에 누가 더 나은지를 측정하는 metric 은 연구가 충분히 이뤄지지 않는다.

▶ Textual similarity of Dialogue
Document 나 social media, transcript 등의 textual source 에 대한 similarity 측정은 이미 많은 연구가 이뤄졌고, 꽤 좋은 성능을 보여주고 있다. 이러한 것 연구에는 Word2Vec, GloV2, Universal Sentence Encoder 등의 연국 ㅏ포함된다.

그러나, task-oriented conversation 은 기존의 metric 들의 적용에 여러 challenge 가 존재한다. 우선, TOD 는 distinct component (e.g. intents, slots, utterances) 를 포함하고 있어, similiarty 와 overlap 에 impact 가 될 수 있다. 예를 들면, user 둘은 다른 objective (e.g. booking travel vs. product returns) 를 가지고 있을 수 있지만, 같은 intent 를 가지고 있을 수 있고, 다른 slot info 를 원할 ㅅ수 있다. 두 번째로, information 이 multiple conversation turn에 걸쳐 제공된다는 점이 metric 으로의 어려움을 증가시킨다. 마지막으로, 같은 task 의 set 들도 여러 user utterance 들로 표현이 될 수 있으며, 이것들은 choice of phrasing, order of sentences, use of colloquialism 등을 포함할 수 있다.

따라서, distance based similairty of utterance embedding 에 의존하는 것은 매우 나쁜 성능을 미친다.

▶ TaskDiff
이에 저자들은 TaskDiff 라는 novel similarity metric designed for TOD 를 제안한다. 위의 그림처럼, 여러 user 들은 같은 대화를 하지만, re-ordered task 와 paraphrased utterance 등을 통해 바뀔 수 있는데, 기존의 방법들은 (SBERT, ConvED, HOTT 등)은 틀리거나 robust 하지 않은 것을 볼 수 있다.

Ideal Metric to measure conversational similarity 는 conversation 의 overall goal 을 반드시 맞춰야 한다는 것이다. 위의 그림 역시 overall goal 은 세 대화 모두 동일하다. TaskDiff 는 converation 을 distribution 으로 표현한 다음, optimal transport 와 결합하여 similarity 를 측정한다. 여러가지 benchmark 에 taskdiff 를 측정한 결과 높은 performance 와 강한 robustness 를 보인다.

2. Task-Oriented Conversation Similarity

2.1. Definitions

Pre-defined user intents $I$
corresponding slots or parameteres $S$
Conversation $C$
multi-turn sequence of utterances $U$
Overall component of task-oriented conversations $K=[U,I,S]$

2.2. Approach

TaskDiff 는 component-wise distribution 의 distance 로 정의된다. 각각의 component $k \in K$ 에 대하여, 이 것을 distribution 으로 나타낸 이후, cumulative cost of trasnforming or transporting the component-wise distrubution 을 통해, 즉 optimal transport 를 통해 distance 를 계산한다.

Figure 2 에서 Overview를 볼 수 있다. 우선, 으로 slot value 들을 mask 해준다. 이는 unrelated utterance 사이의 lexcial similarity 에 의한 방해를 방지하기 위함이다. 예를 들어, "I want a ticket to the BIG APPLE"과 "I want a ticket to the APPLE CONFERENCE" 는 다른 내용을 담고 있지만, APPLE 이라는 단어 때문에 lexcial similarity 가 높을 수 있다. 이 것을 각각 와 으로 masking 해주면 이런 것을 방지할 수 있다.

이후, SBERT 를 이용하여 conversational embedding 을 얻는다. 그리고 Intent distribution 과 Slot Distribution 까지 얻은 뒤, 이 component 들의 distribution 을 활용하여 converation 을 표현한다.

Distance 는 두 conversation 의 distribution 에 대해 cost Matrix 를 활용한 1-Wassestein opritmal transport distance 를 활용한다.

ㅁ 자세한 notation 은 논문참조

3. Experiemntal Evaluation

3.1. Dataset

SGD / 20 domain / 20,000 conversations

3.2. Baselines

SBERT : cos-sim based similarity metric
Conversational Edit Distance (ConvED)
Hierarchical Optimal Transport (HOTT) : Latent Dirichlet Allocation (LDA)-based similarity metric

3.3 k-NN Classification

baseline metric 들과 TaskDiff metric 들에 대한 비교는 k-NN Classification 으로 진행한다. 비슷한 SGD conversation 들에 대해, k-NN classifiation 을 통해 잘 분류하는지 살펴본다. 결과는 아래와 같다.

TaskDiff 가 압도적으로 잘 similarity 를 표현하는 것을 볼 수 있다.

3.4. Conversational Clusters

k-means clustering 을 통해 SGD 를 표현하였을 때, TaskDiff 가 가장 well-formed and distince cluster 를 보이는 것을 볼 수 있다.

3.5. Robusteness to Reordering

Converational reordering 에도 Distance 가 증가하지 않아, 강한 robustness 를 보이는 것을 알 수 있다.

Conclusion

In this paper we present TaskDiff, a novel metric to measure the similarity between task-oriented conversations. It not only captures semantic similarity between the utterances but also utilizes dialog specific features like intents and slots to identify the overall objective of the conversations. We demonstrate that unlike existing metrics, taking advantage of these unique components is critical and results in significantly improved performance. As part of future work, we will investigate the inclusion of additional dialog features on open domain dialog datasets and the utilization of TaskDiff to improve the performance of various downstream conversational tasks.

Limitations

We demonstrate in this work that TaskDiff is a superior and more robust similarity metric compared to existing state-of-the-art approaches for task-oriented conversations. Given the use of optimal transport to compute similarity as a function of differences over the component distributions (intents, slots, and utterances), TaskDiff is reliant on being given an ontology for the intents and slots present across the conversations. However, this is a fair assumption to make for the domain of task-oriented conversations, and such ontologies are leveraged by real-world deployments such as Google DialogFlow, IBM Watson Assistant, Amazon Lex, etc.

Yongil Kim

[EMNLP2023] TaskDiff: A Similarity Metric for Task-Oriented Conversations