Crowdsourcing Ecologically-Valid Dialogue Data for German

Despite their increasing success, user interactions with smart speech assistants (SAs) are still very limited compared to human-human dialogue. One way to make SA interactions more natural is to train the underlying natural language processing modules on data which reflects how humans would talk to...

Full description

Bibliographic Details
Main Authors: Yannick Frommherz, Alessandra Zarcone
Format: Article
Language:English
Published: Frontiers Media S.A. 2021-06-01
Series:Frontiers in Computer Science
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fcomp.2021.686050/full
id doaj-db6c388879304705b6b5cf278ad06150
record_format Article
spelling doaj-db6c388879304705b6b5cf278ad061502021-06-28T09:55:30ZengFrontiers Media S.A.Frontiers in Computer Science2624-98982021-06-01310.3389/fcomp.2021.686050686050Crowdsourcing Ecologically-Valid Dialogue Data for GermanYannick Frommherz0Alessandra Zarcone1Audio and Media Technologies, Semantic Audio Processing, Fraunhofer Institute for Integrated Circuits IIS, Erlangen, GermanyAudio and Media Technologies, HumAIn, Fraunhofer Institute for Integrated Circuits IIS, Erlangen, GermanyDespite their increasing success, user interactions with smart speech assistants (SAs) are still very limited compared to human-human dialogue. One way to make SA interactions more natural is to train the underlying natural language processing modules on data which reflects how humans would talk to a SA if it was capable of understanding and producing natural dialogue given a specific task. Such data can be collected applying a Wizard-of-Oz approach (WOz), where user and system side are played by humans. WOz allows researchers to simulate human-machine interaction while benefitting from the fact that all participants are human and thus dialogue-competent. More recent approaches have leveraged simple templates specifying a dialogue scenario for crowdsourcing large-scale datasets. Template-based collection efforts, however, come at the cost of data diversity and naturalness. We present a method to crowdsource dialogue data for the SA domain in the WOz framework, which aims at limiting researcher-induced bias in the data while still allowing for a low-resource, scalable data collection. Our method can also be applied to languages other than English (in our case German), for which fewer crowd-workers may be available. We collected data asynchronously, relying only on existing functionalities of Amazon Mechanical Turk, by formulating the task as a dialogue continuation task. Coherence in dialogues is ensured, as crowd-workers always read the dialogue history, and as a unifying scenario is provided for each dialogue. In order to limit bias in the data, rather than using template-based scenarios, we handcrafted situated scenarios which aimed at not pre-script-ing the task into every single detail and not priming the participants’ lexical choices. Our scenarios cued people’s knowledge of common situations and entities relevant for our task, without directly mentioning them, but relying on vague language and circumlocutions. We compare our data (which we publish as the CROWDSS corpus; n = 113 dialogues) with data from MultiWOZ, showing that our scenario approach led to considerably less scripting and priming and thus more ecologically-valid dialogue data. This suggests that small investments in the collection setup can go a long way in improving data quality, even in a low-resource setup.https://www.frontiersin.org/articles/10.3389/fcomp.2021.686050/fulldialogue datavoice assistantscrowdsourcingWizard-of-OzGermanecological validity
collection DOAJ
language English
format Article
sources DOAJ
author Yannick Frommherz
Alessandra Zarcone
spellingShingle Yannick Frommherz
Alessandra Zarcone
Crowdsourcing Ecologically-Valid Dialogue Data for German
Frontiers in Computer Science
dialogue data
voice assistants
crowdsourcing
Wizard-of-Oz
German
ecological validity
author_facet Yannick Frommherz
Alessandra Zarcone
author_sort Yannick Frommherz
title Crowdsourcing Ecologically-Valid Dialogue Data for German
title_short Crowdsourcing Ecologically-Valid Dialogue Data for German
title_full Crowdsourcing Ecologically-Valid Dialogue Data for German
title_fullStr Crowdsourcing Ecologically-Valid Dialogue Data for German
title_full_unstemmed Crowdsourcing Ecologically-Valid Dialogue Data for German
title_sort crowdsourcing ecologically-valid dialogue data for german
publisher Frontiers Media S.A.
series Frontiers in Computer Science
issn 2624-9898
publishDate 2021-06-01
description Despite their increasing success, user interactions with smart speech assistants (SAs) are still very limited compared to human-human dialogue. One way to make SA interactions more natural is to train the underlying natural language processing modules on data which reflects how humans would talk to a SA if it was capable of understanding and producing natural dialogue given a specific task. Such data can be collected applying a Wizard-of-Oz approach (WOz), where user and system side are played by humans. WOz allows researchers to simulate human-machine interaction while benefitting from the fact that all participants are human and thus dialogue-competent. More recent approaches have leveraged simple templates specifying a dialogue scenario for crowdsourcing large-scale datasets. Template-based collection efforts, however, come at the cost of data diversity and naturalness. We present a method to crowdsource dialogue data for the SA domain in the WOz framework, which aims at limiting researcher-induced bias in the data while still allowing for a low-resource, scalable data collection. Our method can also be applied to languages other than English (in our case German), for which fewer crowd-workers may be available. We collected data asynchronously, relying only on existing functionalities of Amazon Mechanical Turk, by formulating the task as a dialogue continuation task. Coherence in dialogues is ensured, as crowd-workers always read the dialogue history, and as a unifying scenario is provided for each dialogue. In order to limit bias in the data, rather than using template-based scenarios, we handcrafted situated scenarios which aimed at not pre-script-ing the task into every single detail and not priming the participants’ lexical choices. Our scenarios cued people’s knowledge of common situations and entities relevant for our task, without directly mentioning them, but relying on vague language and circumlocutions. We compare our data (which we publish as the CROWDSS corpus; n = 113 dialogues) with data from MultiWOZ, showing that our scenario approach led to considerably less scripting and priming and thus more ecologically-valid dialogue data. This suggests that small investments in the collection setup can go a long way in improving data quality, even in a low-resource setup.
topic dialogue data
voice assistants
crowdsourcing
Wizard-of-Oz
German
ecological validity
url https://www.frontiersin.org/articles/10.3389/fcomp.2021.686050/full
work_keys_str_mv AT yannickfrommherz crowdsourcingecologicallyvaliddialoguedataforgerman
AT alessandrazarcone crowdsourcingecologicallyvaliddialoguedataforgerman
_version_ 1721356583196164096