Probabilistic Simhash Matching

Finding near-duplicate documents is an interesting problem but the existing methods are not suitable for large scale datasets and memory constrained systems. In this work, we developed approaches that tackle the problem of finding near-duplicates while improving query performance and using less memo...

Full description

Bibliographic Details
Main Author:	Sood, Sadhan
Other Authors:	Loguinov, Dmitri
Format:	Others
Language:	en_US
Published:	2012
Subjects:	Hamming distance near-duplicate similarity search ﬁnger- print web crawl clustering web document
Online Access:	http://hdl.handle.net/1969.1/ETD-TAMU-2011-08-9813

Internet

http://hdl.handle.net/1969.1/ETD-TAMU-2011-08-9813

Probabilistic Simhash Matching

Internet

Similar Items