algorithm - how to solve different url forward to same page in web robot application -


i have problem web robot application.

url a:http://www.domain.com/path?id=1

url b:http://www.domain.com/path?id=1&sessionid=xxxxxx

there 2 urls , forward same page.robot application download page twice.

in robot application, 2 url convert md5 value check visited . url string changed, md5 value changed. visited cache can not hit.

have better solution?

if i'd use algorithm calculate similarity of content, , if similar configured threshold, consider them same document. checking absolute equality (like md5sum), won't work because dynamic contents (like timestamp) break such scheme.

using document similarity common approach in web crawling prevent robots downloading same content on , over.

a simple similarity algorithm levenshtein distance job, cosine similarity better suited this.


Comments

Popular posts from this blog

SPSS keyboard combination alters encoding -

Add new record to the table by click on the button in Microsoft Access -

javascript - jQuery .height() return 0 when visible but non-0 when hidden -