algorithm - how to solve different url forward to same page in web robot application -


i have problem web robot application.

url a:http://www.domain.com/path?id=1

url b:http://www.domain.com/path?id=1&sessionid=xxxxxx

there 2 urls , forward same page.robot application download page twice.

in robot application, 2 url convert md5 value check visited . url string changed, md5 value changed. visited cache can not hit.

have better solution?

if i'd use algorithm calculate similarity of content, , if similar configured threshold, consider them same document. checking absolute equality (like md5sum), won't work because dynamic contents (like timestamp) break such scheme.

using document similarity common approach in web crawling prevent robots downloading same content on , over.

a simple similarity algorithm levenshtein distance job, cosine similarity better suited this.


Comments

Popular posts from this blog

SPSS keyboard combination alters encoding -

Socket.connect doesn't throw exception in Android -

iphone - How do I keep MDScrollView from truncating my row headers and making my cells look bad? -