Archive for the 'http' Category

24
Feb
09

how to scrape data from sites you can’t log into

As a disclaimer, you are responsible of how you use the information I’m showing you here, this is only illustrated as a learning experience.There are many sites for which writing some code that logs into them is very difficult, and this is why we will use the browser’s cookies. This way, you can focus only on the scraping task.

IMPORTANT:I am using firefox for this tutorial.

You should know that websites identify you by the use of cookies. This is wikipedia’s definition:

HTTP cookies, more commonly referred to as Web cookies, tracking cookies or just cookies, are parcels of text sent by a server to a Web client (usually a browser) and then sent back unchanged by the client each time it accesses that server. HTTP cookies are used for authenticating, session tracking (state maintenance), and maintaining specific information about users, such as site preferences or the contents of their electronic shopping carts. The term “cookie” is derived from “magic cookie,” a well-known concept in UNIX computing which inspired both the idea and the name of HTTP cookies. Tracking cookies track your web browsing habits. They can collect information about pages and advertisements you have seen or any other activity during browsing. Different websites can share tracking cookies, and each website with the same tracking cookie can read the information and write new information into it.

You can read more about them here.

Here is what you have to do:

  • open the site that you want to scrape
  • login
  • install the firefox addon Cookie Monster
  • restart firefox
  • in firefox’s bottom right corner you should now have an icon with the letters CM ( big blue C ). Left click it. You should see something like this :
    cookie_monster_view
  • click View cookies , and select the first option “Show cookies for [whatever site you're visiting now]“
  • You should now see all the cookies for the site. Here’s what it looks like when I visit google
    cookie_monster_view2
  • in order to appear as logged in, the cookie header should contain the cookies that Cookie Monster is showing. So, all you have to do is to create a string, to which you append them in the following format: cookie_name=cookie_value;other_cookie_name=other_cookie_value . As you can observe, the cookies are separated by a semicolon. I don’t think the order is important, I’m adding them in the order Cookie Monster shows them.
  • By adding that cookie to each request you make, you will appear as logged in, and you perform the scraping you desire. Here is some Java code that demonstrates the use of this:
    
    import java.io.*;
    import java.net.*;
    
    public class CookieChecker {
        public static void main(String[] args) throws Exception {
            URLConnection con = new URL("http://some_site").openConnection();
            con.setAllowUserInteraction(false);
            con.setDoOutput(true);
            con.addRequestProperty("Cookie", "cookie_name1=cookie_val1;cookie_name2=cookie_val2;");
            BufferedReader br = new BufferedReader(new InputStreamReader(con.getInputStream()));
            String line = "";
            while((line = br.readLine()) != null) {
                System.out.println(line);
            }
        }
    }
    
  • you should know that this isn’t the only way to accomplish this task. You could also “automate” Firefox from your code. Check out this post I wrote to see how to do it using WebDriver.
  • the next step is to choose a scraping technique, and build your scraper along it. Check out web scraping techniques for more information.

Hopefully, this tutorial will help you in your scraping tasks.

See you later!

UPDATE: it seems further explanations are needed :) . On a usual web scraping task I follow these steps ( and I’m sure most of you do the same thing )

  1. navigate to the login page
  2. you inspect the fields with firebug, in order to find out the name of the form they belong to, as well as their names/id’s
  3. you login and you look for signs that login was successful ( for example, something on the page that says “welcome back,user”
  4. you start writing a mechanize/selenium/(insert your favorite framework here) script
  5. you write a regex that will check for the welcome message
  6. if everything was successful, you go ahead with the other part

By using the cookie directly, you go from step 1 to step 6 directly. You pass GO, you collect 200$ :) . Personally, I’m using this when I need something done fast ( you could consider this prototyping ). Best practice would be to login normally to a website, but, I think it’s good to know you can do things even faster. It’s true that you depend on the browser, and that each time you want to use the code you would have to make sure the cookie is still valid, but, as I said before, this is just for quick & dirty development.
Some of you asked “Why Java? Use mechanize;mechanize is uber1337.” . I know mechanize is cool, I’m a big fan myself. I started to use mechanize way back, when there only was the perl version. I like the way ruby’s mechanize evolved, and I personally consider it to be the best testing framework around. So, on topic, I used java because I had NetBeans open at the time I wrote this article. Didn’t expect this answer, did you?
I wrote in a comment that when you work with a team of developers, not all the developers will have the same skills. I find it easier to show a developer a snippet of code like the one I pasted below, than to show him how to go from step 1 to step 6. The thing is, most of the time, a deadline will be present, the boss keeps pressing you, and you can’t really spare 1-2 hours to guide someone through this process.
kthxbai

19
Feb
09

don’t be afraid to try other frameworks

I really love open source. If open source would not exist, I think my productivity would drop a lot. The great thing is that you always have alternatives. If you don’t want to be coding in a specific language, there’s a big chance you’ll find something similar for another language. For example, mechanize is available for perl/ruby/python:

Because each library is built upon different other libraries, stuff won’t always work out the way you want them to.

For example, today I was trying to scrape a page using XPath. I chose mechanize and nokogiri, and I obtained the XPath expression with XPath checker. However, I wasn’t getting back the stuff I needed. In fact, I wasn’t getting back any results.

The first thing I tried was to switch from nokogiri to REXML. The problem didn’t stop here. REXML raised an exception, from which I deduced that the page wasn’t well formed. Usually, in situations like this, you would want to use tidy to clean it up. Surprisingly, this didn’t fix it either, REXML still raised the same exception.

I usually use mechanize for all my scraping needs, but using something different isn’t out of my comfort zone. So, I thought I would try something Java based. Because I didn’t want to write a couple of hundred lines of code for some simple scraping, I decided to go along with WebDriver. This is WebDriver’s description:

WebDriver has a simple API designed to be easy to work with and can drive both real browsers, for testing javascript heavy applications, and a pure ‘in memory’ solution for faster testing of simpler applications.

With webdriver, I was able to control Firefox, and my XPath worked! Just like that ! From this point on, everything worked just fine.

Even though I could have dropped XPath and still used mechanize/nokogiri for scraping, switching to WebDriver kept the task very simple. Here’s some WebDriver sample code:

public class WebDriverTest {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        // use the default firefox profile
        WebDriver driver = new FirefoxDriver("default");

        // insert your XPath expression here
        String xpath = "xpath_expression";

        // load the page
        driver.get("http://some_site");

        // find elements of the page by XPath
        List elements = driver.findElements(By.xpath(xpath));

        // if elements were found
        if(elements != null) {

            // iterate over each element
            for(WebElement element : elements) {
                // print text
                System.out.println(element.getText());
            }

        }
        else {
            System.out.println("nothing found");
        }
    }

As you can see, the code is very readable. The only thing I needed to add to this code was to place the whole XPath extraction in a loop, and clicking a button named “Next” to get to another page.

This doesn’t apply only to web scraping, but to other programming related activities. You should always explore other frameworks , because some of them will increase your productivity more than others. Knowing the “competition” can and most likely will help you, even if not now, in the long run it will be worth the time spent investing in them.

P.S: is it just me, or this blog’s title is becoming ssssssscraping? :)

13
Feb
09

how to write a spider


In this article I will teach you how to write a spider . According to wikipedia , this is the definition of a spider :

A Web crawler (also known as a Web spider, Web robot, or—especially in the FOAF community—Web scutter) is a program or automated script that browses the World Wide Web in a methodical, automated manner

In layman’s terms , this means : a spider is an application that does something with every link on every page it finds . Let’s say for example you want to check your website to see if all your links are still good ( no 404’s ) . To accomplish this , we could use a spider , and set a callback so that each time the spider finds a broken link , something ( defined by us ) happens . Maybe writing a report , or , storing the links in a database . Who knows ? The decision is yours .

Steps to follow in writing your own crawler

  1. Requesting a page
    The first step is deciding how are you going to interact with a page . In order to load a page , a http connection should be made , and the page should be requested . Here’s how a request would be done if it should happen that you don’t have a browser nearby :

    telnet somesite.com 80
    GET /index.html HTTP/1.0
    press ENTER/RETURN twice

    In the example above , I’m using telnet to connect to the web server . Most sites will be served on port 80 , but don’t be surprised if you find some that are using other ports ( like 8080 ) . The second line of the example specifies the method we’re going to use to request the page . In this case , we are using GET . The /index.html part specifies which page we want ( http://somesite.com/index.html ) , and the HTTP version that we want to use . After that line , press RETURN/ENTER twice , and the server will send you the page ( along with the HTTP headers ) . It’s not something really complicated to do , even from your code , but it will become pretty cumbersome once you need to work with cookies , or use other HTTP methods . That’s why I would suggest you should use some higher-level library that will do all the heavy-lifting for you . For this article, I will use WWW::Mechanize for ruby .

  2. Parsing an HTML page
    If we are to find the links in a page , we must parse it . Some guys accomplish this task with regexes , Smarter guys use HTML parsers . You are free to use whichever parser you like . I am using mechanize’s parser nokogiri .
  3. The Spidering per se
    Once you know how to load a page and parse the links in it , the rest is just a matter of recursion . As any recursion solution , it should have a stop condition , or else , bad things happen :) . A pseudo-code for a spider could be this :

    request page
    for each link in page's links
      if(link matches our conditions and link wasn't visited before)
        do something with the link
        add the link to the already visited links
        go back to first step with link as page

Those would be the steps of writing a very simple crawler . I’ll show you a crawler I wrote in ruby :

require "rubygems"
require "mechanize"

class Crawler < WWW::Mechanize

  attr_accessor :callback
  INDEX = 0
  DOWNLOAD = 1
  PASS = 2

  def initialize
    super
    init
    @first = true
    self.user_agent_alias = "Windows IE 6"
  end

  def init
    @visited = []
  end

  def remember(link)
    @visited << link
  end

  def perform_index(link)
    self.get(link)
    if(self.page.class.to_s == "WWW::Mechanize::Page")
      links = self.page.links.map {|link| link.href } - @visited
      links.each do |alink|
        start(alink)
      end
    end
  end

  def start(link)
    return if link.nil?
    if(!@visited.include?(link))
      action = @callback.call(link)
      if(@first)
        @first = false
        perform_index(link)
      end
      case action
        when INDEX
          perform_index(link)
        when DOWNLOAD
          self.get(link).save_as(File.basename(link))
        when PASS
          puts "passing on #{link}"
      end
    end
  end

  def get(site)
    begin
      puts "getting #{site}"
      @visited << site
      super(site)
    rescue
      puts "error getting #{site}"
    end
  end
end

Would you believe pasting the code and formatting it took more than 30 minutes ? There must be a better way of pasting code . The code is pretty self-explanatory . I’m defining the crawler as a subclass of WWW::Mechanize , and I’m setting it’s default user-agent to that of Internet Explorer 6 . After creating the spider, the coder must supply a callback ( in the form of a lambda/block ) . The block takes a link as an argument , and must return an int ( INDEX,DOWNLOAD,PASS ) . This is how the crawler will react to those ints :

  • if the block returns INDEX , the spider will process that link , and all the links found on that page ( an indexing process )
  • if the block returns DOWNLOAD , the spider will download that link
  • if the block returns PASS , the spider will ignore the link

Here is an usage sample :

require "crawler"

x = Crawler.new
callback = lambda do |link|
  if(link =~/\.(zip|rar|gz|pdf|doc)
    x.remember(link)
    return Crawler::PASS
  elsif(link =~/\.(jpg|jpeg)/)
    return Crawler::DOWNLOAD
  end
  return Crawler::INDEX;
end


x.callback = callback
x.start("http://somesite.com")

The script above will ignore zip,rar,gz archives , pdf and word documents , and will download jpg files . The rest of the links will be indexed . The way the spider's written now , it will only add to the visited links the links on which he performed get , so , in order to speed up some things , we'll add the links that are ignored to the visited list too . This is done with the remember method . If you are curious about the speed improvement of the callback using the remember method , test it on a website . Try it with it , and then without it .
Overall , this is a pretty basic spider , which can be extended/improved in a lot of ways . I hope you found this article useful . Catch you around !

Geo