A Reproducible IT-Blog Corpus

The dataset comprises text and metadata extracted from several hundred IT-blogs and websites, along with a method to duplicate the data by updating its contents and downloading it to the user’s local machine. The targets have been hand-picked with the intention to represent the discourse on blogs an...

Full description

Bibliographic Details
Main Authors:	Adrien Barbaresi, Jens Pohlmann
Format:	Article
Language:	English
Published:	Ubiquity Press 2021-07-01
Series:	Journal of Open Humanities Data
Subjects:	web blogs corpus linguistics internet policy discourse analysis public discussion freedom of expression
Online Access:	https://openhumanitiesdata.metajnl.com/articles/35

id	doaj-f0183d32d0ac4f04adbb6b78d2982a02
record_format	Article
spelling	doaj-f0183d32d0ac4f04adbb6b78d2982a022021-08-11T08:05:52ZengUbiquity PressJournal of Open Humanities Data2059-481X2021-07-01710.5334/johd.3536A Reproducible IT-Blog CorpusAdrien Barbaresi0Jens Pohlmann1Center for Digital Lexicography of German, BBAW, BerlinCentre for Media, Communication & Information Research (ZeMKI), University of Bremen, Bremen, DE; Center for Spatial and Textual Analysis (CESTA), Stanford University, StanfordThe dataset comprises text and metadata extracted from several hundred IT-blogs and websites, along with a method to duplicate the data by updating its contents and downloading it to the user’s local machine. The targets have been hand-picked with the intention to represent the discourse on blogs and websites dedicated to questions at the intersection of technology and society from Germany and the United States of America. The texts have been retrieved by web crawling techniques. The resulting corpus is accessible through a search platform and also reproducible with freely accessible descriptors and software.https://openhumanitiesdata.metajnl.com/articles/35web blogscorpus linguisticsinternet policydiscourse analysispublic discussionfreedom of expression
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Adrien Barbaresi Jens Pohlmann
spellingShingle	Adrien Barbaresi Jens Pohlmann A Reproducible IT-Blog Corpus Journal of Open Humanities Data web blogs corpus linguistics internet policy discourse analysis public discussion freedom of expression
author_facet	Adrien Barbaresi Jens Pohlmann
author_sort	Adrien Barbaresi
title	A Reproducible IT-Blog Corpus
title_short	A Reproducible IT-Blog Corpus
title_full	A Reproducible IT-Blog Corpus
title_fullStr	A Reproducible IT-Blog Corpus
title_full_unstemmed	A Reproducible IT-Blog Corpus
title_sort	reproducible it-blog corpus
publisher	Ubiquity Press
series	Journal of Open Humanities Data
issn	2059-481X
publishDate	2021-07-01
description	The dataset comprises text and metadata extracted from several hundred IT-blogs and websites, along with a method to duplicate the data by updating its contents and downloading it to the user’s local machine. The targets have been hand-picked with the intention to represent the discourse on blogs and websites dedicated to questions at the intersection of technology and society from Germany and the United States of America. The texts have been retrieved by web crawling techniques. The resulting corpus is accessible through a search platform and also reproducible with freely accessible descriptors and software.
topic	web blogs corpus linguistics internet policy discourse analysis public discussion freedom of expression
url	https://openhumanitiesdata.metajnl.com/articles/35
work_keys_str_mv	AT adrienbarbaresi areproducibleitblogcorpus AT jenspohlmann areproducibleitblogcorpus AT adrienbarbaresi reproducibleitblogcorpus AT jenspohlmann reproducibleitblogcorpus
_version_	1721211548820570112

A Reproducible IT-Blog Corpus

Similar Items