Probabilistic Simhash Matching

Finding near-duplicate documents is an interesting problem but the existing methods are not suitable for large scale datasets and memory constrained systems. In this work, we developed approaches that tackle the problem of finding near-duplicates while improving query performance and using less memo...

Full description

Bibliographic Details
Main Author: Sood, Sadhan
Other Authors: Loguinov, Dmitri
Format: Others
Language:en_US
Published: 2012
Subjects:
Online Access:http://hdl.handle.net/1969.1/ETD-TAMU-2011-08-9813