13
Feb
09

how to write a spider


In this article I will teach you how to write a spider . According to wikipedia , this is the definition of a spider :

A Web crawler (also known as a Web spider, Web robot, or—especially in the FOAF community—Web scutter) is a program or automated script that browses the World Wide Web in a methodical, automated manner

In layman’s terms , this means : a spider is an application that does something with every link on every page it finds . Let’s say for example you want to check your website to see if all your links are still good ( no 404’s ) . To accomplish this , we could use a spider , and set a callback so that each time the spider finds a broken link , something ( defined by us ) happens . Maybe writing a report , or , storing the links in a database . Who knows ? The decision is yours .

Steps to follow in writing your own crawler

  1. Requesting a page
    The first step is deciding how are you going to interact with a page . In order to load a page , a http connection should be made , and the page should be requested . Here’s how a request would be done if it should happen that you don’t have a browser nearby :

    telnet somesite.com 80
    GET /index.html HTTP/1.0
    press ENTER/RETURN twice

    In the example above , I’m using telnet to connect to the web server . Most sites will be served on port 80 , but don’t be surprised if you find some that are using other ports ( like 8080 ) . The second line of the example specifies the method we’re going to use to request the page . In this case , we are using GET . The /index.html part specifies which page we want ( http://somesite.com/index.html ) , and the HTTP version that we want to use . After that line , press RETURN/ENTER twice , and the server will send you the page ( along with the HTTP headers ) . It’s not something really complicated to do , even from your code , but it will become pretty cumbersome once you need to work with cookies , or use other HTTP methods . That’s why I would suggest you should use some higher-level library that will do all the heavy-lifting for you . For this article, I will use WWW::Mechanize for ruby .

  2. Parsing an HTML page
    If we are to find the links in a page , we must parse it . Some guys accomplish this task with regexes , Smarter guys use HTML parsers . You are free to use whichever parser you like . I am using mechanize’s parser nokogiri .
  3. The Spidering per se
    Once you know how to load a page and parse the links in it , the rest is just a matter of recursion . As any recursion solution , it should have a stop condition , or else , bad things happen :) . A pseudo-code for a spider could be this :

    request page
    for each link in page's links
      if(link matches our conditions and link wasn't visited before)
        do something with the link
        add the link to the already visited links
        go back to first step with link as page

Those would be the steps of writing a very simple crawler . I’ll show you a crawler I wrote in ruby :

require "rubygems"
require "mechanize"

class Crawler < WWW::Mechanize

  attr_accessor :callback
  INDEX = 0
  DOWNLOAD = 1
  PASS = 2

  def initialize
    super
    init
    @first = true
    self.user_agent_alias = "Windows IE 6"
  end

  def init
    @visited = []
  end

  def remember(link)
    @visited << link
  end

  def perform_index(link)
    self.get(link)
    if(self.page.class.to_s == "WWW::Mechanize::Page")
      links = self.page.links.map {|link| link.href } - @visited
      links.each do |alink|
        start(alink)
      end
    end
  end

  def start(link)
    return if link.nil?
    if(!@visited.include?(link))
      action = @callback.call(link)
      if(@first)
        @first = false
        perform_index(link)
      end
      case action
        when INDEX
          perform_index(link)
        when DOWNLOAD
          self.get(link).save_as(File.basename(link))
        when PASS
          puts "passing on #{link}"
      end
    end
  end

  def get(site)
    begin
      puts "getting #{site}"
      @visited << site
      super(site)
    rescue
      puts "error getting #{site}"
    end
  end
end

Would you believe pasting the code and formatting it took more than 30 minutes ? There must be a better way of pasting code . The code is pretty self-explanatory . I’m defining the crawler as a subclass of WWW::Mechanize , and I’m setting it’s default user-agent to that of Internet Explorer 6 . After creating the spider, the coder must supply a callback ( in the form of a lambda/block ) . The block takes a link as an argument , and must return an int ( INDEX,DOWNLOAD,PASS ) . This is how the crawler will react to those ints :

  • if the block returns INDEX , the spider will process that link , and all the links found on that page ( an indexing process )
  • if the block returns DOWNLOAD , the spider will download that link
  • if the block returns PASS , the spider will ignore the link

Here is an usage sample :

require "crawler"

x = Crawler.new
callback = lambda do |link|
  if(link =~/\.(zip|rar|gz|pdf|doc)
    x.remember(link)
    return Crawler::PASS
  elsif(link =~/\.(jpg|jpeg)/)
    return Crawler::DOWNLOAD
  end
  return Crawler::INDEX;
end


x.callback = callback
x.start("http://somesite.com")

The script above will ignore zip,rar,gz archives , pdf and word documents , and will download jpg files . The rest of the links will be indexed . The way the spider's written now , it will only add to the visited links the links on which he performed get , so , in order to speed up some things , we'll add the links that are ignored to the visited list too . This is done with the remember method . If you are curious about the speed improvement of the callback using the remember method , test it on a website . Try it with it , and then without it .
Overall , this is a pretty basic spider , which can be extended/improved in a lot of ways . I hope you found this article useful . Catch you around !

Geo


14 Responses to “how to write a spider”


  1. February 13, 2009 at 6:29 pm

    There are a lot of ready made crawlers online that you can use to check for broken links, etc. But of course that takes all the fun out of it since you don’t get the chance to code it yourself.

    I use the wordpress plugin Dean’s Code Highlighter. It supports most languages and makes code formating a lot easier. It does syntax highlighting as a bonus. You should check it out.

  2. 2 geo
    February 13, 2009 at 6:52 pm

    I’m not really sure if as a free user I could add plugins to wordpress . I looked around in the administration panel but I didn’t find anything related to plugins .

  3. February 13, 2009 at 10:26 pm

    Oh right. Pretty sure you would need to buy hosting if you want to add plugins.

  4. February 17, 2009 at 3:49 am

    You can also use Google Webmaster tools to check the broken links

  5. 5 geo
    February 17, 2009 at 9:40 am

    Yes, but checking for broken links is just a small task sometimes. Most of the time you would have to do a lot more work .

  6. March 14, 2009 at 9:39 pm

    Sehr wertvolle Informationen! Empfehlen!

  7. 8 Joe
    March 16, 2009 at 3:10 pm

    Very interesting! Do you know of a follow-up article that would talk about dealing with cookies as you mentioned? Thank you.

  8. 9 geo
    March 16, 2009 at 3:41 pm

    I posted many articles related to web scraping. You can find them under the scraping category in the blog. Working with cookies is covered.

  9. 10 Pieter
    July 26, 2009 at 8:24 pm

    I have just started looking at Ruby as a means of parsing starred RSS links from Google Reader and have been able to get a list of links from all my starred items in Google Reader by doing this:

    require “rubygems”
    require “open-uri”
    require “simple-rss”
    feed = “http://www.google.com/reader/public/atom/user/xxxxx/state/com.google/starred?n=400″
    rss = SimpleRSS.parse open(feed)
    rss.entries.each do |item|
    puts “#{item.title}\n”
    end

    where xxxxx is my 20 digit account number

    At the moment I run it in ruby as ruby mystarredlinks.rb>links.html which gives me a list of links. This is nice but not exactly what I am looking for.

    What I want to do is to now follow each of those links and extract from the resulting page the links to SPECIFIC sites (i.e if there is a link to Site1 index that link and do the next link from the original Google Reader list, if not check if there is a link to Site2 and so on until a valid link is found. The output of those valid links (one per page) are to be stored in an html file

    Any ideas where to start?

    • July 27, 2009 at 6:53 am

      Sure:

      use a html parser to parse your resulting file
      use mechanize to follow each link
      use nokogiri/hpricot to extract the information you need


Leave a Reply