Identifying Equivalent URLs Using URL Signatures

Citation

Soon, Lay-Ki and Lee, Sang Ho (2008) Identifying Equivalent URLs Using URL Signatures. In: IEEE International Conference on Signal Image Technology and Internet Based Systems, 2008. SITIS '08. IEEE, 203 -210. ISBN 978-0-7695-3493-0

[img] Text
04725805.pdf - Published Version
Restricted to Repository staff only

Download (556kB)

Abstract

In the standard URL normalization mechanism, URLs are normalized syntactically by a set of predefined steps. In this paper, we propose to enhance the standard URL normalization by incorporating the semantically meaningful metadata of the Web pages. The metadata taken into account are the body texts of the Web pages, which can be extracted during HTML parsing. Given a URL which has undergone the standard normalization mechanism, we construct its URL signature by hashing or fingerprinting the body text of the associated Web page using Message-Digest algorithm 5. URLs which share identical signatures are considered to be equivalent in our scheme. The experimental results show that our proposed method helps to further reduce redundant Web information retrieval by 34.57% in comparison with the standard URL normalization mechanism.

Item Type: Book Section
Subjects: T Technology > T Technology (General)
Divisions: Faculty of Information Science and Technology (FIST)
Depositing User: Ms Suzilawati Abu Samah
Date Deposited: 14 Nov 2013 06:51
Last Modified: 14 Nov 2013 06:51
URII: http://shdl.mmu.edu.my/id/eprint/4408

Downloads

Downloads per month over past year

View ItemEdit (login required)