15
Feb
09

web scraping techniques



At one time or another , every developer has to extract data from multiple pages . I know most of you guys use regexes . I do too sometimes . I even know someone who used php’s explode function ( hello Andrei ! ) and then extracted the results . I will show you some of the techniques used to extract data , along with some of the advantages and disadvantes of each one .
In this tutorial I will show you how to scrape the name of the categories in this blog . They are shown in the picture below :

categories1

I am sorry I didn’t capture all the categories in the picture , but I think you got the point . It usually takes me a couple of tries until I can figure out how to extract the data of a website . It would help ( and it would be better for the website that you’re scraping ) if you would write the response to a file ( or serialize it ) and work on the local version . This way , you wouldn’t make unnecessary requests . Here’s what I’m talking about :


require "rubygems"
require "mechanize"

site = ARGV[0] || (abort "I need a link")
mech = WWW::Mechanize.new
mech.get(site)

File.open("body_serialized.ser","w") do |file|
   Marshal.dump(mech.page.body,file)
end

The script above will request a page and store it’s content in a file named body_serialized.ser . Until we can figure out what technique we’re going to use to parse the data , we’ll work with the string stored in the body_serialized.ser file .

  • Technique no.1 : Text searching ( no regexes )

    This is one of the most used techniques out there . Most of the time , it’s used by people who don’t understand regular expressions . This is one of the techniques that works on trial and error . Here is the first try :

    
    # first try
    text = ""
    
    File.open("body_serialized.ser") do |file|
    	text = Marshal.load(file)
    end
    
    text.each_line do |line|
    	line = line.downcase
    	if(line.index("title=\"view all posts"))
    		contents = line.split(" ")
    		interesting = contents[9]
    		puts interesting
    	end	   
    end
    

    Running the script produces the following output :

    
    
    <a
    nil
    http">http</a>
    java">java</a>
    music">music</a>
    ruby">ruby</a>
    spider">spider</a>
    swing">swing</a>
    synchronization">synchronization</a>
    threads">threads</a>
    videos">videos</a>
    

    It’s a start ! We’ll have to work a bit on this one … here’s the script version that extracts what we want :

    
    # first try
    text = ""
    
    File.open("body_serialized.ser") do |file|
    	text = Marshal.load(file)
    end
    
    text.each_line do |line|
    	line = line.downcase
    	if(line.index("title=\"view all posts"))
    		contents = line.split(" ")
    		if(contents.size == 10)
    			interesting = contents[9]
    			quote_position = interesting.index("\"")
    			if(!quote_position.nil?)
    				interesting = interesting[0,quote_position]
    				puts interesting
    			end
    		end
    	end	   
    end
    
    

    Running this script produces the following output :

    
    http
    java
    music
    ruby
    spider
    swing
    synchronization
    threads
    videos
    

    And that was exactly what we needed ! Unlike the first version of this script , this one checks to see if after splitting a line into tokens , we have 10 tokens returned . The reason I did this is to filter out the categories that appear in posts . For example , if I didn’t had this condition , the script would also try to parse the categories from my post , and it would fail , because the html code for them is a little bit different .

    Advantages of this technique :

    • You don’t need to understand regexes
    • You don’t need any third-party library/framework/tool
    • Pretty easy to do if you know a bit of programming

    Disadvantages of this technique :

    • Searching text in this way takes a lot of tries to get to the right results
    • If the site’s html changes it’s structure , the script is unusable ( you would have to do this all over again )
  • Technique no.2 : Text searching ( with regexes )
  • If you understand regexes , this job will be a lot simpler & faster . This site is one of the best sites related to regular expressions . If you don’t know them , I would recommand you pay it a visit . Here’s the script that extracts the categories using regular expressions :

    
    text = ""
    
    File.open("body_serialized.ser") do |file|
    	text = Marshal.load(file)
    end
    
    matches = text.scan(/title=\"View all posts.*?>(.*?)<\/a>/i)
    matches.each do |match|
    	puts match
    end
    

    Running this script produces exactly the results we’re after . If you don’t believe me , try it !

    Advantages of this technique

    • You get to write less code
    • Development speed greatly increases ! This script took me about 2 minutes to write ( and it worked the first time ) , while the script for the first technique took about 10 minutes ( and I wrote it using trial and error ) .
    • The regex is pretty easy to replace ( you don’t have to modify the whole script to make it work if the site changes it’s structure )

    Disadvantages of this technique

    • You would have to know regexes to use this technique ( duh :) )
    • A lot of developers don’t know how to use them/find them difficult to use
  • Technique no.3 : XPath
  • In order to get the most out of this technique , I assume you’re using firefox . You would have to install the following addon XPath checker. After you installed the addon , navigate to this blog , right click on one of the categoriess , and select “View XPath” , like in the following picture :

    snapshot1

    A window like this will appear :
    xpath_window

    As you can see in the window , we have the xpath expression for music . Notice that if you manipulate the xpath expression , you will see the “matches” in the window . In this example , we want to find the other categories as well . The xpath expression the window shows is this : id(‘categories-352220371′)/ul/li[3]/a , which loosely translates to :

    • get me the the link who is a child of the third li , who is the child of an unordered list , who is a child of something with the id of categories-352220371

    We can see that the li part looks like an array of some sort . If we check with the page , we can see that the music category is indeed the third category . So , what do you think will happen if we replace li[3] with li ? You guessed right , we get a list of all the categories . I think you will agree with me when I say that this technique is very simple and effective . Here is the code that extracts the categories using xpath :

    
    require "rubygems"
    require "nokogiri"
    
    text = ""
    
    File.open("body_serialized.ser") do |file|
    	text = Marshal.load(file)
    end
    
    doc = Nokogiri::HTML(text)
    doc.xpath("id('categories-352220371')/ul/li/a").each do |category|
    	puts category.text.chomp
    end
    

    Advantages of this technique

    • Development speed is greatly enhanced ! It took me 2 minutes to write the script once I found the xpath expression .
    • The xpath expression is very easy to replace ! ( a matter of seconds )
    • Both firefox and the xpath checker addon are free to use

    Disadvantages of this technique

  • Technique no.4 : scraping using CSS ( kind of )
  • You could scrape a web page “with style” . This doesn’t mean you should wear an Armani suit while writing your code :) . It means you could use the CSS selectors to find the information you need . Since I’m not a web developer , and I know extremely little CSS , I won’t show you how to use this technique ( because I don’t know how to use it either ) . It’s sufficient to know that it exists . You can find more informations about it here .
    Advantages of this technique

    • if you know how to use CSS , I think your development speed would increase
    • I think scripts using this technique won’t be modified as often as the others . From what I know , a page’s style doesn’t change that often.

    Disadvantages of this technique

    • You must know CSS

Perhaps other web scraping techniques exist , perhaps they are better than the ones I showed here . I don’t know . These are the ones I use ( I don’t really use the first one now ) , and they work for me . I hope you enjoyed this tutorial !

See you around !

Geo

About these ads

13 Responses to “web scraping techniques”


  1. February 17, 2009 at 11:55

    You can also use Watir and Firewatir to scrape the page… but it loads up the browser navigates to the right page and then scrapes it… it is good for pages that require cookies, but rather slow in general (especially since it is *supposed* to be a web testing framework)

  2. 2 geo
    February 17, 2009 at 12:34

    I haven’t really used Watir/Firewatir , but I used Win32::IE::Mechanize for perl with success ( about 2 years ago ) . As for the cookies part , mechanize can handle them . The reason I would use watir/firewatir/win32::ie::mechanize would be to go around a captcha based login :)

  3. February 17, 2009 at 12:37

    Great post! I had to develop a web-scrapping software some time ago, to extract classified ads from multiple sites. What I did was to include all the generic scrapping functions on the software and have an XML file with the particular structure for every site I wanted to scrap.

  4. 4 Joe
    March 17, 2009 at 00:30

    Thanks, Geo. Nice article. How do you input data to a web page automatically – do you have an article on that?

  5. 5 geo
    March 17, 2009 at 09:14

    Are you referring to filling out forms?

  6. 6 az
    April 7, 2009 at 04:26

    I need to scrape a few websites and am not a coder. Only 2 sites require logins. All of the data is public domain data. Where can I find someone who will do it without breaking my limited budget?

  7. 8 flyby
    April 11, 2009 at 17:27

    Hey, use IRobotSoft dummies’ tool to do it.

  8. August 25, 2010 at 13:58

    Nice useful article keep post like this ..


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Blog Stats

  • 168,183 hits

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: