Crowdsourcing Ecologically-Valid Dialogue Data for German

Despite their increasing success, user interactions with smart speech assistants (SAs) are still very limited compared to human-human dialogue. One way to make SA interactions more natural is to train the underlying natural language processing modules on data which reflects how humans would talk to...

Full description

Bibliographic Details
Main Authors:	Yannick Frommherz, Alessandra Zarcone
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2021-06-01
Series:	Frontiers in Computer Science
Subjects:	dialogue data voice assistants crowdsourcing Wizard-of-Oz German ecological validity
Online Access:	https://www.frontiersin.org/articles/10.3389/fcomp.2021.686050/full

id	doaj-db6c388879304705b6b5cf278ad06150
record_format	Article
spelling	doaj-db6c388879304705b6b5cf278ad061502021-06-28T09:55:30ZengFrontiers Media S.A.Frontiers in Computer Science2624-98982021-06-01310.3389/fcomp.2021.686050686050Crowdsourcing Ecologically-Valid Dialogue Data for GermanYannick Frommherz0Alessandra Zarcone1Audio and Media Technologies, Semantic Audio Processing, Fraunhofer Institute for Integrated Circuits IIS, Erlangen, GermanyAudio and Media Technologies, HumAIn, Fraunhofer Institute for Integrated Circuits IIS, Erlangen, GermanyDespite their increasing success, user interactions with smart speech assistants (SAs) are still very limited compared to human-human dialogue. One way to make SA interactions more natural is to train the underlying natural language processing modules on data which reflects how humans would talk to a SA if it was capable of understanding and producing natural dialogue given a specific task. Such data can be collected applying a Wizard-of-Oz approach (WOz), where user and system side are played by humans. WOz allows researchers to simulate human-machine interaction while benefitting from the fact that all participants are human and thus dialogue-competent. More recent approaches have leveraged simple templates specifying a dialogue scenario for crowdsourcing large-scale datasets. Template-based collection efforts, however, come at the cost of data diversity and naturalness. We present a method to crowdsource dialogue data for the SA domain in the WOz framework, which aims at limiting researcher-induced bias in the data while still allowing for a low-resource, scalable data collection. Our method can also be applied to languages other than English (in our case German), for which fewer crowd-workers may be available. We collected data asynchronously, relying only on existing functionalities of Amazon Mechanical Turk, by formulating the task as a dialogue continuation task. Coherence in dialogues is ensured, as crowd-workers always read the dialogue history, and as a unifying scenario is provided for each dialogue. In order to limit bias in the data, rather than using template-based scenarios, we handcrafted situated scenarios which aimed at not pre-script-ing the task into every single detail and not priming the participants’ lexical choices. Our scenarios cued people’s knowledge of common situations and entities relevant for our task, without directly mentioning them, but relying on vague language and circumlocutions. We compare our data (which we publish as the CROWDSS corpus; n = 113 dialogues) with data from MultiWOZ, showing that our scenario approach led to considerably less scripting and priming and thus more ecologically-valid dialogue data. This suggests that small investments in the collection setup can go a long way in improving data quality, even in a low-resource setup.https://www.frontiersin.org/articles/10.3389/fcomp.2021.686050/fulldialogue datavoice assistantscrowdsourcingWizard-of-OzGermanecological validity
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Yannick Frommherz Alessandra Zarcone
spellingShingle	Yannick Frommherz Alessandra Zarcone Crowdsourcing Ecologically-Valid Dialogue Data for German Frontiers in Computer Science dialogue data voice assistants crowdsourcing Wizard-of-Oz German ecological validity
author_facet	Yannick Frommherz Alessandra Zarcone
author_sort	Yannick Frommherz
title	Crowdsourcing Ecologically-Valid Dialogue Data for German
title_short	Crowdsourcing Ecologically-Valid Dialogue Data for German
title_full	Crowdsourcing Ecologically-Valid Dialogue Data for German
title_fullStr	Crowdsourcing Ecologically-Valid Dialogue Data for German
title_full_unstemmed	Crowdsourcing Ecologically-Valid Dialogue Data for German
title_sort	crowdsourcing ecologically-valid dialogue data for german
publisher	Frontiers Media S.A.
series	Frontiers in Computer Science
issn	2624-9898
publishDate	2021-06-01
description	Despite their increasing success, user interactions with smart speech assistants (SAs) are still very limited compared to human-human dialogue. One way to make SA interactions more natural is to train the underlying natural language processing modules on data which reflects how humans would talk to a SA if it was capable of understanding and producing natural dialogue given a specific task. Such data can be collected applying a Wizard-of-Oz approach (WOz), where user and system side are played by humans. WOz allows researchers to simulate human-machine interaction while benefitting from the fact that all participants are human and thus dialogue-competent. More recent approaches have leveraged simple templates specifying a dialogue scenario for crowdsourcing large-scale datasets. Template-based collection efforts, however, come at the cost of data diversity and naturalness. We present a method to crowdsource dialogue data for the SA domain in the WOz framework, which aims at limiting researcher-induced bias in the data while still allowing for a low-resource, scalable data collection. Our method can also be applied to languages other than English (in our case German), for which fewer crowd-workers may be available. We collected data asynchronously, relying only on existing functionalities of Amazon Mechanical Turk, by formulating the task as a dialogue continuation task. Coherence in dialogues is ensured, as crowd-workers always read the dialogue history, and as a unifying scenario is provided for each dialogue. In order to limit bias in the data, rather than using template-based scenarios, we handcrafted situated scenarios which aimed at not pre-script-ing the task into every single detail and not priming the participants’ lexical choices. Our scenarios cued people’s knowledge of common situations and entities relevant for our task, without directly mentioning them, but relying on vague language and circumlocutions. We compare our data (which we publish as the CROWDSS corpus; n = 113 dialogues) with data from MultiWOZ, showing that our scenario approach led to considerably less scripting and priming and thus more ecologically-valid dialogue data. This suggests that small investments in the collection setup can go a long way in improving data quality, even in a low-resource setup.
topic	dialogue data voice assistants crowdsourcing Wizard-of-Oz German ecological validity
url	https://www.frontiersin.org/articles/10.3389/fcomp.2021.686050/full
work_keys_str_mv	AT yannickfrommherz crowdsourcingecologicallyvaliddialoguedataforgerman AT alessandrazarcone crowdsourcingecologicallyvaliddialoguedataforgerman
_version_	1721356583196164096

Crowdsourcing Ecologically-Valid Dialogue Data for German

Similar Items