Online Analysis of High Volume Social Text Streams

Social media is one of the most disruptive developments of the past decade. The impact of this information revolution has been fundamental on our society. Information dissemination has never been cheaper and users are increasingly connected with each other. The line between content producers and co...

Full description

Bibliographic Details
Main Author: Bansal, Nilesh
Other Authors: Koudas, Nick
Language:en_ca
Published: 2013
Subjects:
Online Access:http://hdl.handle.net/1807/43485
id ndltd-TORONTO-oai-tspace.library.utoronto.ca-1807-43485
record_format oai_dc
spelling ndltd-TORONTO-oai-tspace.library.utoronto.ca-1807-434852014-01-08T04:09:14ZOnline Analysis of High Volume Social Text StreamsBansal, Nileshsocial mediaanalyticsalgorithms0984Social media is one of the most disruptive developments of the past decade. The impact of this information revolution has been fundamental on our society. Information dissemination has never been cheaper and users are increasingly connected with each other. The line between content producers and consumers is blurred, leaving us with abundance of data produced in real-time by users around the world on multitude of topics. In this thesis we study techniques to aid an analyst in uncovering insights from this new media form which is modeled as a high volume social text stream. The aim is to develop practical algorithms with focus on the ability to scale, amenability to reliable operation, usability, and ease of implementation. Our work lies at the intersection of building large scale real world systems and developing theoretical foundation to support the same. We identify three key predicates to enable online methods for analysis of social data, namely : - Persistent Chatter Discovery to explore topics discussed over a period of time, - Cross-referencing Media Sources to initiate analysis using a document as the query, and - Contributor Understanding to create aggregate expertise and topic summaries of authors contributing online. The thesis defines each of the predicates in detail and covers proposed techniques, their practical applicability, and detailed experimental results to establish accuracy and scalability for each of the three predicates. We present BlogScope, the core data aggregation and management platform, developed as part of the thesis to enable implementation of the key predicates in real world setting. The system provides a web based user interface for searching social media conversations and analyzing the results in multitude of ways. BlogScope, and its modified versions, index tens to hundreds of billions of text documents while providing interactive query times. Specifically, BlogScope has been crawling 50 million active blogs with 3.25 billion blog posts. Same techniques have also been successfully tested on a Twitter stream of data, adding thousands of new Tweets every second and archiving over 30 billion documents. The social graph part of our database consists of 26 million Twitter user nodes with 17 billion follower edges. The BlogScope system has been used by over 10,000 unique visitors a day, and the commercial version of the system is used by thousands of enterprise clients globally. As social media continues to evolve at an exponential pace, there is a lot that still needs to be studied. The thesis concludes by outlining some of possible future research directions.Koudas, Nick2013-112014-01-07T18:56:55ZNO_RESTRICTION2014-01-07T18:56:55Z2014-01-07Thesishttp://hdl.handle.net/1807/43485en_ca
collection NDLTD
language en_ca
sources NDLTD
topic social media
analytics
algorithms
0984
spellingShingle social media
analytics
algorithms
0984
Bansal, Nilesh
Online Analysis of High Volume Social Text Streams
description Social media is one of the most disruptive developments of the past decade. The impact of this information revolution has been fundamental on our society. Information dissemination has never been cheaper and users are increasingly connected with each other. The line between content producers and consumers is blurred, leaving us with abundance of data produced in real-time by users around the world on multitude of topics. In this thesis we study techniques to aid an analyst in uncovering insights from this new media form which is modeled as a high volume social text stream. The aim is to develop practical algorithms with focus on the ability to scale, amenability to reliable operation, usability, and ease of implementation. Our work lies at the intersection of building large scale real world systems and developing theoretical foundation to support the same. We identify three key predicates to enable online methods for analysis of social data, namely : - Persistent Chatter Discovery to explore topics discussed over a period of time, - Cross-referencing Media Sources to initiate analysis using a document as the query, and - Contributor Understanding to create aggregate expertise and topic summaries of authors contributing online. The thesis defines each of the predicates in detail and covers proposed techniques, their practical applicability, and detailed experimental results to establish accuracy and scalability for each of the three predicates. We present BlogScope, the core data aggregation and management platform, developed as part of the thesis to enable implementation of the key predicates in real world setting. The system provides a web based user interface for searching social media conversations and analyzing the results in multitude of ways. BlogScope, and its modified versions, index tens to hundreds of billions of text documents while providing interactive query times. Specifically, BlogScope has been crawling 50 million active blogs with 3.25 billion blog posts. Same techniques have also been successfully tested on a Twitter stream of data, adding thousands of new Tweets every second and archiving over 30 billion documents. The social graph part of our database consists of 26 million Twitter user nodes with 17 billion follower edges. The BlogScope system has been used by over 10,000 unique visitors a day, and the commercial version of the system is used by thousands of enterprise clients globally. As social media continues to evolve at an exponential pace, there is a lot that still needs to be studied. The thesis concludes by outlining some of possible future research directions.
author2 Koudas, Nick
author_facet Koudas, Nick
Bansal, Nilesh
author Bansal, Nilesh
author_sort Bansal, Nilesh
title Online Analysis of High Volume Social Text Streams
title_short Online Analysis of High Volume Social Text Streams
title_full Online Analysis of High Volume Social Text Streams
title_fullStr Online Analysis of High Volume Social Text Streams
title_full_unstemmed Online Analysis of High Volume Social Text Streams
title_sort online analysis of high volume social text streams
publishDate 2013
url http://hdl.handle.net/1807/43485
work_keys_str_mv AT bansalnilesh onlineanalysisofhighvolumesocialtextstreams
_version_ 1716623029715337216