A Reproducible IT-Blog Corpus
The dataset comprises text and metadata extracted from several hundred IT-blogs and websites, along with a method to duplicate the data by updating its contents and downloading it to the user’s local machine. The targets have been hand-picked with the intention to represent the discourse on blogs an...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Ubiquity Press
2021-07-01
|
Series: | Journal of Open Humanities Data |
Subjects: | |
Online Access: | https://openhumanitiesdata.metajnl.com/articles/35 |
id |
doaj-f0183d32d0ac4f04adbb6b78d2982a02 |
---|---|
record_format |
Article |
spelling |
doaj-f0183d32d0ac4f04adbb6b78d2982a022021-08-11T08:05:52ZengUbiquity PressJournal of Open Humanities Data2059-481X2021-07-01710.5334/johd.3536A Reproducible IT-Blog CorpusAdrien Barbaresi0Jens Pohlmann1Center for Digital Lexicography of German, BBAW, BerlinCentre for Media, Communication & Information Research (ZeMKI), University of Bremen, Bremen, DE; Center for Spatial and Textual Analysis (CESTA), Stanford University, StanfordThe dataset comprises text and metadata extracted from several hundred IT-blogs and websites, along with a method to duplicate the data by updating its contents and downloading it to the user’s local machine. The targets have been hand-picked with the intention to represent the discourse on blogs and websites dedicated to questions at the intersection of technology and society from Germany and the United States of America. The texts have been retrieved by web crawling techniques. The resulting corpus is accessible through a search platform and also reproducible with freely accessible descriptors and software.https://openhumanitiesdata.metajnl.com/articles/35web blogscorpus linguisticsinternet policydiscourse analysispublic discussionfreedom of expression |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Adrien Barbaresi Jens Pohlmann |
spellingShingle |
Adrien Barbaresi Jens Pohlmann A Reproducible IT-Blog Corpus Journal of Open Humanities Data web blogs corpus linguistics internet policy discourse analysis public discussion freedom of expression |
author_facet |
Adrien Barbaresi Jens Pohlmann |
author_sort |
Adrien Barbaresi |
title |
A Reproducible IT-Blog Corpus |
title_short |
A Reproducible IT-Blog Corpus |
title_full |
A Reproducible IT-Blog Corpus |
title_fullStr |
A Reproducible IT-Blog Corpus |
title_full_unstemmed |
A Reproducible IT-Blog Corpus |
title_sort |
reproducible it-blog corpus |
publisher |
Ubiquity Press |
series |
Journal of Open Humanities Data |
issn |
2059-481X |
publishDate |
2021-07-01 |
description |
The dataset comprises text and metadata extracted from several hundred IT-blogs and websites, along with a method to duplicate the data by updating its contents and downloading it to the user’s local machine. The targets have been hand-picked with the intention to represent the discourse on blogs and websites dedicated to questions at the intersection of technology and society from Germany and the United States of America. The texts have been retrieved by web crawling techniques. The resulting corpus is accessible through a search platform and also reproducible with freely accessible descriptors and software. |
topic |
web blogs corpus linguistics internet policy discourse analysis public discussion freedom of expression |
url |
https://openhumanitiesdata.metajnl.com/articles/35 |
work_keys_str_mv |
AT adrienbarbaresi areproducibleitblogcorpus AT jenspohlmann areproducibleitblogcorpus AT adrienbarbaresi reproducibleitblogcorpus AT jenspohlmann reproducibleitblogcorpus |
_version_ |
1721211548820570112 |