Redblock: a tool for online deduplication on large datasets

Online data deduplication aims to identify records that represent the same purpose on a continuous data flow environment. It must be able to process a range of information with high effectiveness and no delays. The purpose of this paper is to introduce a developed tool entitled Redblock, for real-ti...

Full description

Bibliographic Details
Main Authors: Luan Félix Pimentel, Igor Lemos Vicente, Guilherme Dal Bianco
Format: Article
Language:English
Published: Universidade de Passo Fundo (UPF) 2017-07-01
Series:Revista Brasileira de Computação Aplicada
Subjects:
Online Access:http://seer.upf.br/index.php/rbca/article/view/7143
Description
Summary:Online data deduplication aims to identify records that represent the same purpose on a continuous data flow environment. It must be able to process a range of information with high effectiveness and no delays. The purpose of this paper is to introduce a developed tool entitled Redblock, for real-time data deduplication, using a distributed platform for online processing combined with an Inverted Index. During the experimental evaluation, Redblock managed to provide good preliminary results in terms of efficiency and effectiveness in a database.
ISSN:2176-6649