I really love Sinatra, the web framework. It’s so lightweight and enjoyable. You should give it a try.
Archive for the 'web' Category
sinatra
i’m learning web development
In the past few weeks, I had many job offers. All of them were related to web design/web application development. Because my web skills are pretty much non-existent, I had to turn them down.
So, in an effort to become a better developer, I’m learning webpy as a web framework, and mootools as a javascript framework.
Because I don’t want to be reading books and tutorials related to building “web skills”, I decided to learn them by writing an application. I decided to write a microblogging application. Things are going somewhat well … I have a textarea that “resembles” the ones used in this kind of applications ( somewhat functional ). I still have a long way to go, and I’m pretty excited.
webkit for swt
While checking new links on dzone, I found this. Just imagine how many cool tools you can build with this ![]()
In the previous posts, I’ve written about the techniques one could use to perform web scraping. I feel it’s important that developers know how to detect spiders and how to restrict them.
I think that the StackOverflowquestion “How do you stop scripters from slamming your website hundreds of times a second?” compiles the best information related to this topic. You can read the whole thing here.
web scraping techniques
At one time or another , every developer has to extract data from multiple pages . I know most of you guys use regexes . I do too sometimes . I even know someone who used php’s explode function ( hello Andrei ! ) and then extracted the results . I will show you some of the techniques used to extract data , along with some of the advantages and disadvantes of each one .
In this tutorial I will show you how to scrape the name of the categories in this blog . They are shown in the picture below :

I am sorry I didn’t capture all the categories in the picture , but I think you got the point . It usually takes me a couple of tries until I can figure out how to extract the data of a website . It would help ( and it would be better for the website that you’re scraping ) if you would write the response to a file ( or serialize it ) and work on the local version . This way , you wouldn’t make unnecessary requests . Here’s what I’m talking about :
require "rubygems"
require "mechanize"
site = ARGV[0] || (abort "I need a link")
mech = WWW::Mechanize.new
mech.get(site)
File.open("body_serialized.ser","w") do |file|
Marshal.dump(mech.page.body,file)
end
The script above will request a page and store it’s content in a file named body_serialized.ser . Until we can figure out what technique we’re going to use to parse the data , we’ll work with the string stored in the body_serialized.ser file .
- Technique no.1 : Text searching ( no regexes )
This is one of the most used techniques out there . Most of the time , it’s used by people who don’t understand regular expressions . This is one of the techniques that works on trial and error . Here is the first try :
# first try text = "" File.open("body_serialized.ser") do |file| text = Marshal.load(file) end text.each_line do |line| line = line.downcase if(line.index("title=\"view all posts")) contents = line.split(" ") interesting = contents[9] puts interesting end endRunning the script produces the following output :
<a nil http">http</a> java">java</a> music">music</a> ruby">ruby</a> spider">spider</a> swing">swing</a> synchronization">synchronization</a> threads">threads</a> videos">videos</a>It’s a start ! We’ll have to work a bit on this one … here’s the script version that extracts what we want :
# first try text = "" File.open("body_serialized.ser") do |file| text = Marshal.load(file) end text.each_line do |line| line = line.downcase if(line.index("title=\"view all posts")) contents = line.split(" ") if(contents.size == 10) interesting = contents[9] quote_position = interesting.index("\"") if(!quote_position.nil?) interesting = interesting[0,quote_position] puts interesting end end end endRunning this script produces the following output :
http java music ruby spider swing synchronization threads videosAnd that was exactly what we needed ! Unlike the first version of this script , this one checks to see if after splitting a line into tokens , we have 10 tokens returned . The reason I did this is to filter out the categories that appear in posts . For example , if I didn’t had this condition , the script would also try to parse the categories from my post , and it would fail , because the html code for them is a little bit different .
Advantages of this technique :
- You don’t need to understand regexes
- You don’t need any third-party library/framework/tool
- Pretty easy to do if you know a bit of programming
Disadvantages of this technique :
- Searching text in this way takes a lot of tries to get to the right results
- If the site’s html changes it’s structure , the script is unusable ( you would have to do this all over again )
- Technique no.2 : Text searching ( with regexes )
If you understand regexes , this job will be a lot simpler & faster . This site is one of the best sites related to regular expressions . If you don’t know them , I would recommand you pay it a visit . Here’s the script that extracts the categories using regular expressions :
text = ""
File.open("body_serialized.ser") do |file|
text = Marshal.load(file)
end
matches = text.scan(/title=\"View all posts.*?>(.*?)<\/a>/i)
matches.each do |match|
puts match
end
Running this script produces exactly the results we’re after . If you don’t believe me , try it !
Advantages of this technique
- You get to write less code
- Development speed greatly increases ! This script took me about 2 minutes to write ( and it worked the first time ) , while the script for the first technique took about 10 minutes ( and I wrote it using trial and error ) .
- The regex is pretty easy to replace ( you don’t have to modify the whole script to make it work if the site changes it’s structure )
Disadvantages of this technique
- You would have to know regexes to use this technique ( duh
) - A lot of developers don’t know how to use them/find them difficult to use
In order to get the most out of this technique , I assume you’re using firefox . You would have to install the following addon XPath checker. After you installed the addon , navigate to this blog , right click on one of the categoriess , and select “View XPath” , like in the following picture :

A window like this will appear :

As you can see in the window , we have the xpath expression for music . Notice that if you manipulate the xpath expression , you will see the “matches” in the window . In this example , we want to find the other categories as well . The xpath expression the window shows is this : id(‘categories-352220371′)/ul/li[3]/a , which loosely translates to :
- get me the the link who is a child of the third li , who is the child of an unordered list , who is a child of something with the id of categories-352220371
We can see that the li part looks like an array of some sort . If we check with the page , we can see that the music category is indeed the third category . So , what do you think will happen if we replace li[3] with li ? You guessed right , we get a list of all the categories . I think you will agree with me when I say that this technique is very simple and effective . Here is the code that extracts the categories using xpath :
require "rubygems"
require "nokogiri"
text = ""
File.open("body_serialized.ser") do |file|
text = Marshal.load(file)
end
doc = Nokogiri::HTML(text)
doc.xpath("id('categories-352220371')/ul/li/a").each do |category|
puts category.text.chomp
end
Advantages of this technique
- Development speed is greatly enhanced ! It took me 2 minutes to write the script once I found the xpath expression .
- The xpath expression is very easy to replace ! ( a matter of seconds )
- Both firefox and the xpath checker addon are free to use
Disadvantages of this technique
- The page should be well formed ( you should always try to use a parser that can work with malformed documents , ex : BeautifulSoup ( for python ) , or clean the page’s source with something like tidy ( which has bindings for a lot of programming languages )
- Pretty hard to use this technique if you don’t know xpath , or can’t use the xpath checker addon
You could scrape a web page “with style” . This doesn’t mean you should wear an Armani suit while writing your code
. It means you could use the CSS selectors to find the information you need . Since I’m not a web developer , and I know extremely little CSS , I won’t show you how to use this technique ( because I don’t know how to use it either ) . It’s sufficient to know that it exists . You can find more informations about it here .
Advantages of this technique
- if you know how to use CSS , I think your development speed would increase
- I think scripts using this technique won’t be modified as often as the others . From what I know , a page’s style doesn’t change that often.
Disadvantages of this technique
- You must know CSS
Perhaps other web scraping techniques exist , perhaps they are better than the ones I showed here . I don’t know . These are the ones I use ( I don’t really use the first one now ) , and they work for me . I hope you enjoyed this tutorial !
See you around !
Geo