A Reproducible IT-Blog Corpus

The dataset comprises text and metadata extracted from several hundred IT-blogs and websites, along with a method to duplicate the data by updating its contents and downloading it to the user’s local machine. The targets have been hand-picked with the intention to represent the discourse on blogs an...

Full description

Bibliographic Details
Main Authors: Adrien Barbaresi, Jens Pohlmann
Format: Article
Language:English
Published: Ubiquity Press 2021-07-01
Series:Journal of Open Humanities Data
Subjects:
Online Access:https://openhumanitiesdata.metajnl.com/articles/35
id doaj-f0183d32d0ac4f04adbb6b78d2982a02
record_format Article
spelling doaj-f0183d32d0ac4f04adbb6b78d2982a022021-08-11T08:05:52ZengUbiquity PressJournal of Open Humanities Data2059-481X2021-07-01710.5334/johd.3536A Reproducible IT-Blog CorpusAdrien Barbaresi0Jens Pohlmann1Center for Digital Lexicography of German, BBAW, BerlinCentre for Media, Communication & Information Research (ZeMKI), University of Bremen, Bremen, DE; Center for Spatial and Textual Analysis (CESTA), Stanford University, StanfordThe dataset comprises text and metadata extracted from several hundred IT-blogs and websites, along with a method to duplicate the data by updating its contents and downloading it to the user’s local machine. The targets have been hand-picked with the intention to represent the discourse on blogs and websites dedicated to questions at the intersection of technology and society from Germany and the United States of America. The texts have been retrieved by web crawling techniques. The resulting corpus is accessible through a search platform and also reproducible with freely accessible descriptors and software.https://openhumanitiesdata.metajnl.com/articles/35web blogscorpus linguisticsinternet policydiscourse analysispublic discussionfreedom of expression
collection DOAJ
language English
format Article
sources DOAJ
author Adrien Barbaresi
Jens Pohlmann
spellingShingle Adrien Barbaresi
Jens Pohlmann
A Reproducible IT-Blog Corpus
Journal of Open Humanities Data
web blogs
corpus linguistics
internet policy
discourse analysis
public discussion
freedom of expression
author_facet Adrien Barbaresi
Jens Pohlmann
author_sort Adrien Barbaresi
title A Reproducible IT-Blog Corpus
title_short A Reproducible IT-Blog Corpus
title_full A Reproducible IT-Blog Corpus
title_fullStr A Reproducible IT-Blog Corpus
title_full_unstemmed A Reproducible IT-Blog Corpus
title_sort reproducible it-blog corpus
publisher Ubiquity Press
series Journal of Open Humanities Data
issn 2059-481X
publishDate 2021-07-01
description The dataset comprises text and metadata extracted from several hundred IT-blogs and websites, along with a method to duplicate the data by updating its contents and downloading it to the user’s local machine. The targets have been hand-picked with the intention to represent the discourse on blogs and websites dedicated to questions at the intersection of technology and society from Germany and the United States of America. The texts have been retrieved by web crawling techniques. The resulting corpus is accessible through a search platform and also reproducible with freely accessible descriptors and software.
topic web blogs
corpus linguistics
internet policy
discourse analysis
public discussion
freedom of expression
url https://openhumanitiesdata.metajnl.com/articles/35
work_keys_str_mv AT adrienbarbaresi areproducibleitblogcorpus
AT jenspohlmann areproducibleitblogcorpus
AT adrienbarbaresi reproducibleitblogcorpus
AT jenspohlmann reproducibleitblogcorpus
_version_ 1721211548820570112