Archive for the 'web' Category

28
Feb
10

sinatra


I really love Sinatra, the web framework. It’s so lightweight and enjoyable. You should give it a try.

12
Mar
09

i’m learning web development

In the past few weeks, I had many job offers. All of them were related to web design/web application development. Because my web skills are pretty much non-existent, I had to turn them down.

So, in an effort to become a better developer, I’m learning webpy as a web framework, and mootools as a javascript framework.

Because I don’t want to be reading books and tutorials related to building “web skills”, I decided to learn them by writing an application. I decided to write a microblogging application. Things are going somewhat well … I have a textarea that “resembles” the ones used in this kind of applications ( somewhat functional ). I still have a long way to go, and I’m pretty excited.

05
Mar
09

webkit for swt


While checking new links on dzone, I found this. Just imagine how many cool tools you can build with this 🙂

24
Feb
09

how to detect spiders/web crawlers

In the previous posts, I’ve written about the techniques one could use to perform web scraping. I feel it’s important that developers know how to detect spiders and how to restrict them.

I think that the StackOverflowquestion “How do you stop scripters from slamming your website hundreds of times a second?” compiles the best information related to this topic. You can read the whole thing here.

15
Feb
09

web scraping techniques


At one time or another , every developer has to extract data from multiple pages . I know most of you guys use regexes . I do too sometimes . I even know someone who used php’s explode function ( hello Andrei ! ) and then extracted the results . I will show you some of the techniques used to extract data , along with some of the advantages and disadvantes of each one .
In this tutorial I will show you how to scrape the name of the categories in this blog . They are shown in the picture below :

categories1

I am sorry I didn’t capture all the categories in the picture , but I think you got the point . It usually takes me a couple of tries until I can figure out how to extract the data of a website . It would help ( and it would be better for the website that you’re scraping ) if you would write the response to a file ( or serialize it ) and work on the local version . This way , you wouldn’t make unnecessary requests . Here’s what I’m talking about :


require "rubygems"
require "mechanize"

site = ARGV[0] || (abort "I need a link")
mech = WWW::Mechanize.new
mech.get(site)

File.open("body_serialized.ser","w") do |file|
   Marshal.dump(mech.page.body,file)
end

The script above will request a page and store it’s content in a file named body_serialized.ser . Until we can figure out what technique we’re going to use to parse the data , we’ll work with the string stored in the body_serialized.ser file .

  • Technique no.1 : Text searching ( no regexes )

    This is one of the most used techniques out there . Most of the time , it’s used by people who don’t understand regular expressions . This is one of the techniques that works on trial and error . Here is the first try :

    
    # first try
    text = ""
    
    File.open("body_serialized.ser") do |file|
    	text = Marshal.load(file)
    end
    
    text.each_line do |line|
    	line = line.downcase
    	if(line.index("title=\"view all posts"))
    		contents = line.split(" ")
    		interesting = contents[9]
    		puts interesting
    	end	   
    end
    

    Running the script produces the following output :

    
    
    <a
    nil
    http">http</a>
    java">java</a>
    music">music</a>
    ruby">ruby</a>
    spider">spider</a>
    swing">swing</a>
    synchronization">synchronization</a>
    threads">threads</a>
    videos">videos</a>
    

    It’s a start ! We’ll have to work a bit on this one … here’s the script version that extracts what we want :

    
    # first try
    text = ""
    
    File.open("body_serialized.ser") do |file|
    	text = Marshal.load(file)
    end
    
    text.each_line do |line|
    	line = line.downcase
    	if(line.index("title=\"view all posts"))
    		contents = line.split(" ")
    		if(contents.size == 10)
    			interesting = contents[9]
    			quote_position = interesting.index("\"")
    			if(!quote_position.nil?)
    				interesting = interesting[0,quote_position]
    				puts interesting
    			end
    		end
    	end	   
    end
    
    

    Running this script produces the following output :

    
    http
    java
    music
    ruby
    spider
    swing
    synchronization
    threads
    videos
    

    And that was exactly what we needed ! Unlike the first version of this script , this one checks to see if after splitting a line into tokens , we have 10 tokens returned . The reason I did this is to filter out the categories that appear in posts . For example , if I didn’t had this condition , the script would also try to parse the categories from my post , and it would fail , because the html code for them is a little bit different .

    Advantages of this technique :

    • You don’t need to understand regexes
    • You don’t need any third-party library/framework/tool
    • Pretty easy to do if you know a bit of programming

    Disadvantages of this technique :

    • Searching text in this way takes a lot of tries to get to the right results
    • If the site’s html changes it’s structure , the script is unusable ( you would have to do this all over again )
  • Technique no.2 : Text searching ( with regexes )
  • If you understand regexes , this job will be a lot simpler & faster . This site is one of the best sites related to regular expressions . If you don’t know them , I would recommand you pay it a visit . Here’s the script that extracts the categories using regular expressions :

    
    text = ""
    
    File.open("body_serialized.ser") do |file|
    	text = Marshal.load(file)
    end
    
    matches = text.scan(/title=\"View all posts.*?>(.*?)<\/a>/i)
    matches.each do |match|
    	puts match
    end
    

    Running this script produces exactly the results we’re after . If you don’t believe me , try it !

    Advantages of this technique

    • You get to write less code
    • Development speed greatly increases ! This script took me about 2 minutes to write ( and it worked the first time ) , while the script for the first technique took about 10 minutes ( and I wrote it using trial and error ) .
    • The regex is pretty easy to replace ( you don’t have to modify the whole script to make it work if the site changes it’s structure )

    Disadvantages of this technique

    • You would have to know regexes to use this technique ( duh 🙂 )
    • A lot of developers don’t know how to use them/find them difficult to use
  • Technique no.3 : XPath
  • In order to get the most out of this technique , I assume you’re using firefox . You would have to install the following addon XPath checker. After you installed the addon , navigate to this blog , right click on one of the categoriess , and select “View XPath” , like in the following picture :

    snapshot1

    A window like this will appear :
    xpath_window

    As you can see in the window , we have the xpath expression for music . Notice that if you manipulate the xpath expression , you will see the “matches” in the window . In this example , we want to find the other categories as well . The xpath expression the window shows is this : id(‘categories-352220371’)/ul/li[3]/a , which loosely translates to :

    • get me the the link who is a child of the third li , who is the child of an unordered list , who is a child of something with the id of categories-352220371

    We can see that the li part looks like an array of some sort . If we check with the page , we can see that the music category is indeed the third category . So , what do you think will happen if we replace li[3] with li ? You guessed right , we get a list of all the categories . I think you will agree with me when I say that this technique is very simple and effective . Here is the code that extracts the categories using xpath :

    
    require "rubygems"
    require "nokogiri"
    
    text = ""
    
    File.open("body_serialized.ser") do |file|
    	text = Marshal.load(file)
    end
    
    doc = Nokogiri::HTML(text)
    doc.xpath("id('categories-352220371')/ul/li/a").each do |category|
    	puts category.text.chomp
    end
    

    Advantages of this technique

    • Development speed is greatly enhanced ! It took me 2 minutes to write the script once I found the xpath expression .
    • The xpath expression is very easy to replace ! ( a matter of seconds )
    • Both firefox and the xpath checker addon are free to use

    Disadvantages of this technique

  • Technique no.4 : scraping using CSS ( kind of )
  • You could scrape a web page “with style” . This doesn’t mean you should wear an Armani suit while writing your code 🙂 . It means you could use the CSS selectors to find the information you need . Since I’m not a web developer , and I know extremely little CSS , I won’t show you how to use this technique ( because I don’t know how to use it either ) . It’s sufficient to know that it exists . You can find more informations about it here .
    Advantages of this technique

    • if you know how to use CSS , I think your development speed would increase
    • I think scripts using this technique won’t be modified as often as the others . From what I know , a page’s style doesn’t change that often.

    Disadvantages of this technique

    • You must know CSS

Perhaps other web scraping techniques exist , perhaps they are better than the ones I showed here . I don’t know . These are the ones I use ( I don’t really use the first one now ) , and they work for me . I hope you enjoyed this tutorial !

See you around !

Geo




Blog Stats

  • 223,848 hits