EFFICIENT METHODS FOR INFORMATION EXTRACTION IN NEWS WEBPAGES

PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO === CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICO === Nós abordamos a tarefa de segmentação de páginas de notícias; mais especificamente identificação do título, data de publicação e corpo da notícia. Embora existam resultados muito...

Full description

Bibliographic Details
Main Author:	EDUARDO TEIXEIRA CARDOSO
Other Authors:	EDUARDO SANY LABER
Language:	English
Published:	PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO 2011
Online Access:	http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=28984@1 http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=28984@2

id	ndltd-IBICT-oai-MAXWELL.puc-rio.br-28984
record_format	oai_dc
spelling	ndltd-IBICT-oai-MAXWELL.puc-rio.br-289842019-03-01T15:42:49Z EFFICIENT METHODS FOR INFORMATION EXTRACTION IN NEWS WEBPAGES MÉTODOS EFICIENTES PARA EXTRAÇÃO DE INFORMAÇÃO EM PÁGINAS DE NOTÍCIAS EDUARDO TEIXEIRA CARDOSO EDUARDO SANY LABER EDUARDO SANY LABER MARCO ANTONIO CASANOVA MARCO ANTONIO CASANOVA MARCO ANTONIO CASANOVA RAUL PIERRE RENTERIA PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICO Nós abordamos a tarefa de segmentação de páginas de notícias; mais especificamente identificação do título, data de publicação e corpo da notícia. Embora existam resultados muito bons na literatura, a maioria deles depende da renderização da página, que é uma tarefa muito demorada. Nós focamos em cenários com um alto volume de documentos, onde desempenho de tempo é uma necessidade. A abordagem escolhida estende nosso trabalho prévio na área, combinando propriedades estruturais com traços de atributos visuais, calculados através de um método mais rápido do que a renderização tradicional, e algoritmos de aprendizado de máquina. Em nossos experimentos, nos atentamos para alguns fatos não comumente abordados na literatura, como tempo de processamento e a generalização dos nossos resultados para domínios desconhecidos. Nossa abordagem se mostrou aproximadamente uma ordem de magnitude mais rápida do que alternativas equivalentes que se apoiam na renderização completa da página e manteve uma boa qualidade de extração. We tackle the task of news webpage segmentation, specifically identifying the news title, publication date and story body. While there are very good results in the literature, most of them rely on webpage rendering, which is a very time-consuming step. We focus on scenarios with a high volume of documents, where a short execution time is a must. The chosen approach extends our previous work in the area, combining structural properties with hints of visual presentation styles, computed with a faster method than regular rendering, and machine learning algorithms. In our experiments, we took special attention to some aspects that are often overlooked in the literature, such as processing time and the generalization of the extraction results for unseen domains. Our approach has shown to be about an order of magnitude faster than an equivalent full rendering alternative while retaining a good quality of extraction. 2011-08-24 info:eu-repo/semantics/publishedVersion info:eu-repo/semantics/masterThesis http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=28984@1 http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=28984@2 eng info:eu-repo/semantics/openAccess PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO PPG EM INFORMÁTICA PUC-Rio BR reponame:Repositório Institucional da PUC_RIO instname:Pontifícia Universidade Católica do Rio de Janeiro instacron:PUC_RIO
collection	NDLTD
language	English
sources	NDLTD
description	PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO === CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICO === Nós abordamos a tarefa de segmentação de páginas de notícias; mais especificamente identificação do título, data de publicação e corpo da notícia. Embora existam resultados muito bons na literatura, a maioria deles depende da renderização da página, que é uma tarefa muito demorada. Nós focamos em cenários com um alto volume de documentos, onde desempenho de tempo é uma necessidade. A abordagem escolhida estende nosso trabalho prévio na área, combinando propriedades estruturais com traços de atributos visuais, calculados através de um método mais rápido do que a renderização tradicional, e algoritmos de aprendizado de máquina. Em nossos experimentos, nos atentamos para alguns fatos não comumente abordados na literatura, como tempo de processamento e a generalização dos nossos resultados para domínios desconhecidos. Nossa abordagem se mostrou aproximadamente uma ordem de magnitude mais rápida do que alternativas equivalentes que se apoiam na renderização completa da página e manteve uma boa qualidade de extração. === We tackle the task of news webpage segmentation, specifically identifying the news title, publication date and story body. While there are very good results in the literature, most of them rely on webpage rendering, which is a very time-consuming step. We focus on scenarios with a high volume of documents, where a short execution time is a must. The chosen approach extends our previous work in the area, combining structural properties with hints of visual presentation styles, computed with a faster method than regular rendering, and machine learning algorithms. In our experiments, we took special attention to some aspects that are often overlooked in the literature, such as processing time and the generalization of the extraction results for unseen domains. Our approach has shown to be about an order of magnitude faster than an equivalent full rendering alternative while retaining a good quality of extraction.
author2	EDUARDO SANY LABER
author_facet	EDUARDO SANY LABER EDUARDO TEIXEIRA CARDOSO
author	EDUARDO TEIXEIRA CARDOSO
spellingShingle	EDUARDO TEIXEIRA CARDOSO EFFICIENT METHODS FOR INFORMATION EXTRACTION IN NEWS WEBPAGES
author_sort	EDUARDO TEIXEIRA CARDOSO
title	EFFICIENT METHODS FOR INFORMATION EXTRACTION IN NEWS WEBPAGES
title_short	EFFICIENT METHODS FOR INFORMATION EXTRACTION IN NEWS WEBPAGES
title_full	EFFICIENT METHODS FOR INFORMATION EXTRACTION IN NEWS WEBPAGES
title_fullStr	EFFICIENT METHODS FOR INFORMATION EXTRACTION IN NEWS WEBPAGES
title_full_unstemmed	EFFICIENT METHODS FOR INFORMATION EXTRACTION IN NEWS WEBPAGES
title_sort	efficient methods for information extraction in news webpages
publisher	PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO
publishDate	2011
url	http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=28984@1 http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=28984@2
work_keys_str_mv	AT eduardoteixeiracardoso efficientmethodsforinformationextractioninnewswebpages AT eduardoteixeiracardoso metodoseficientesparaextracaodeinformacaoempaginasdenoticias
_version_	1718988952009965568

EFFICIENT METHODS FOR INFORMATION EXTRACTION IN NEWS WEBPAGES

Similar Items