大名鼎鼎的网络蜘蛛算法,非常简单有趣,想不到这样子就可以拥有整个internet吧。
Basic Crawling Algorithm
Simple-Crawler(S0,D,E)
///////////////////////////////////////////
//sog注解一下:就是图的宽度优先遍历。
//S0 is seed,such as sina frontpage or ther webpage have many links
//D is documents set.html body content resource.
//E is links set.html hyperlink information.(the edge of graph)
///////////////////////////////////////////
Q<-S0//construct a link queue
While Q≠⊙//while the queue is not empty
Do u<- dequeue(Q)//get element from queue
D(u)<-fetech(u)//get content resource through the link
Store(D,(d(u),u)) //store the content info
L<-parse(d(u))//parse new link from the html content page
For each v in L //loop every element of the links set
Do store(E,(u,v))//store the link info
if ~(v∈D ∨ v∈Q)//if v is not in content set and v is not in link queue
Then enqueue(Q,v)//insert new element into link queue
end//end while loop
////////////////////////////////end of algorithm
welcome search engine fans exchange your idea with me!
--
FROM 211.167.199.*