网络蜘蛛算法,非常简单有趣

水木社区手机版

主题:网络蜘蛛算法,非常简单有趣
楼主|sog|2005-02-06 12:56:36|只看此ID
大名鼎鼎的网络蜘蛛算法,非常简单有趣，想不到这样子就可以拥有整个internet吧。

Basic Crawling Algorithm

Simple-Crawler（S0,D,E）
///////////////////////////////////////////
//sog注解一下：就是图的宽度优先遍历。
//S0 is seed,such as sina frontpage or ther webpage have many links
//D is documents set.html body content resource.
//E is links set.html hyperlink information.(the edge of graph)
///////////////////////////////////////////

Q<-S0//construct a link queue
While Q≠⊙//while the queue is not empty
Do u<- dequeue(Q)//get element from queue
   D(u)<-fetech(u)//get content resource through the link
   Store(D,(d(u),u)) //store the content info
   L<-parse(d(u))//parse new link from the html content page
   For each v in L //loop every element of the links set
   Do store(E,(u,v))//store the link info
        if ～(v∈D ∨ v∈Q)//if v is not in content set and v is not in link queue
             Then enqueue(Q,v)//insert new element into link queue
end//end while loop
////////////////////////////////end of algorithm

welcome search engine fans exchange your idea with me!
--
FROM 211.167.199.*

BYR-Team©2010. KBS Dev-Team©2011 登录完整版