In this article I will teach you how to write a spider . According to wikipedia , this is the definition of a spider :
A Web crawler (also known as a Web spider, Web robot, or—especially in the FOAF community—Web scutter) is a program or automated script that browses the World Wide Web in a methodical, automated manner
In layman’s terms , this means : a spider is an application that does something with every link on every page it finds . Let’s say for example you want to check your website to see if all your links are still good ( no 404’s ) . To accomplish this , we could use a spider , and set a callback so that each time the spider finds a broken link , something ( defined by us ) happens . Maybe writing a report , or , storing the links in a database . Who knows ? The decision is yours .
Steps to follow in writing your own crawler
- Requesting a page
The first step is deciding how are you going to interact with a page . In order to load a page , a http connection should be made , and the page should be requested . Here’s how a request would be done if it should happen that you don’t have a browser nearby :
telnet somesite.com 80
GET /index.html HTTP/1.0
press ENTER/RETURN twice
In the example above , I’m using telnet to connect to the web server . Most sites will be served on port 80 , but don’t be surprised if you find some that are using other ports ( like 8080 ) . The second line of the example specifies the method we’re going to use to request the page . In this case , we are using GET . The /index.html part specifies which page we want ( http://somesite.com/index.html ) , and the HTTP version that we want to use . After that line , press RETURN/ENTER twice , and the server will send you the page ( along with the HTTP headers ) . It’s not something really complicated to do , even from your code , but it will become pretty cumbersome once you need to work with cookies , or use other HTTP methods . That’s why I would suggest you should use some higher-level library that will do all the heavy-lifting for you . For this article, I will use WWW::Mechanize for ruby .
-
Parsing an HTML page
If we are to find the links in a page , we must parse it . Some guys accomplish this task with regexes , Smarter guys use HTML parsers . You are free to use whichever parser you like . I am using mechanize’s parser nokogiri . -
The Spidering per se
Once you know how to load a page and parse the links in it , the rest is just a matter of recursion . As any recursion solution , it should have a stop condition , or else , bad things happen
. A pseudo-code for a spider could be this :
request page
for each link in page's links
if(link matches our conditions and link wasn't visited before)
do something with the link
add the link to the already visited links
go back to first step with link as page
Those would be the steps of writing a very simple crawler . I’ll show you a crawler I wrote in ruby :
require "rubygems"
require "mechanize"
class Crawler < WWW::Mechanize
attr_accessor :callback
INDEX = 0
DOWNLOAD = 1
PASS = 2
def initialize
super
init
@first = true
self.user_agent_alias = "Windows IE 6"
end
def init
@visited = []
end
def remember(link)
@visited << link
end
def perform_index(link)
self.get(link)
if(self.page.class.to_s == "WWW::Mechanize::Page")
links = self.page.links.map {|link| link.href } - @visited
links.each do |alink|
start(alink)
end
end
end
def start(link)
return if link.nil?
if(!@visited.include?(link))
action = @callback.call(link)
if(@first)
@first = false
perform_index(link)
end
case action
when INDEX
perform_index(link)
when DOWNLOAD
self.get(link).save_as(File.basename(link))
when PASS
puts "passing on #{link}"
end
end
end
def get(site)
begin
puts "getting #{site}"
@visited << site
super(site)
rescue
puts "error getting #{site}"
end
end
end
Would you believe pasting the code and formatting it took more than 30 minutes ? There must be a better way of pasting code . The code is pretty self-explanatory . I’m defining the crawler as a subclass of WWW::Mechanize , and I’m setting it’s default user-agent to that of Internet Explorer 6 . After creating the spider, the coder must supply a callback ( in the form of a lambda/block ) . The block takes a link as an argument , and must return an int ( INDEX,DOWNLOAD,PASS ) . This is how the crawler will react to those ints :
- if the block returns INDEX , the spider will process that link , and all the links found on that page ( an indexing process )
- if the block returns DOWNLOAD , the spider will download that link
- if the block returns PASS , the spider will ignore the link
Here is an usage sample :
require "crawler"
x = Crawler.new
callback = lambda do |link|
if(link =~/\.(zip|rar|gz|pdf|doc)
x.remember(link)
return Crawler::PASS
elsif(link =~/\.(jpg|jpeg)/)
return Crawler::DOWNLOAD
end
return Crawler::INDEX;
end
x.callback = callback
x.start("http://somesite.com")
The script above will ignore zip,rar,gz archives , pdf and word documents , and will download jpg files . The rest of the links will be indexed . The way the spider's written now , it will only add to the visited links the links on which he performed get , so , in order to speed up some things , we'll add the links that are ignored to the visited list too . This is done with the remember method . If you are curious about the speed improvement of the callback using the remember method , test it on a website . Try it with it , and then without it .
Overall , this is a pretty basic spider , which can be extended/improved in a lot of ways . I hope you found this article useful . Catch you around !
Geo
There are a lot of ready made crawlers online that you can use to check for broken links, etc. But of course that takes all the fun out of it since you don’t get the chance to code it yourself.
I use the wordpress plugin Dean’s Code Highlighter. It supports most languages and makes code formating a lot easier. It does syntax highlighting as a bonus. You should check it out.
I’m not really sure if as a free user I could add plugins to wordpress . I looked around in the administration panel but I didn’t find anything related to plugins .
Oh right. Pretty sure you would need to buy hosting if you want to add plugins.
You can also use Google Webmaster tools to check the broken links
Yes, but checking for broken links is just a small task sometimes. Most of the time you would have to do a lot more work .
Gut!
Sehr wertvolle Informationen! Empfehlen!
Very interesting! Do you know of a follow-up article that would talk about dealing with cookies as you mentioned? Thank you.
I posted many articles related to web scraping. You can find them under the scraping category in the blog. Working with cookies is covered.
I have just started looking at Ruby as a means of parsing starred RSS links from Google Reader and have been able to get a list of links from all my starred items in Google Reader by doing this:
require “rubygems”
require “open-uri”
require “simple-rss”
feed = “http://www.google.com/reader/public/atom/user/xxxxx/state/com.google/starred?n=400″
rss = SimpleRSS.parse open(feed)
rss.entries.each do |item|
puts “#{item.title}\n”
end
where xxxxx is my 20 digit account number
At the moment I run it in ruby as ruby mystarredlinks.rb>links.html which gives me a list of links. This is nice but not exactly what I am looking for.
What I want to do is to now follow each of those links and extract from the resulting page the links to SPECIFIC sites (i.e if there is a link to Site1 index that link and do the next link from the original Google Reader list, if not check if there is a link to Site2 and so on until a valid link is found. The output of those valid links (one per page) are to be stored in an html file
Any ideas where to start?
Sure:
use a html parser to parse your resulting file
use mechanize to follow each link
use nokogiri/hpricot to extract the information you need