Archive for the 'http' Category


don’t be afraid to try other frameworks

I really love open source. If open source would not exist, I think my productivity would drop a lot. The great thing is that you always have alternatives. If you don’t want to be coding in a specific language, there’s a big chance you’ll find something similar for another language. For example, mechanize is available for perl/ruby/python:

Because each library is built upon different other libraries, stuff won’t always work out the way you want them to.

For example, today I was trying to scrape a page using XPath. I chose mechanize and nokogiri, and I obtained the XPath expression with XPath checker. However, I wasn’t getting back the stuff I needed. In fact, I wasn’t getting back any results.

The first thing I tried was to switch from nokogiri to REXML. The problem didn’t stop here. REXML raised an exception, from which I deduced that the page wasn’t well formed. Usually, in situations like this, you would want to use tidy to clean it up. Surprisingly, this didn’t fix it either, REXML still raised the same exception.

I usually use mechanize for all my scraping needs, but using something different isn’t out of my comfort zone. So, I thought I would try something Java based. Because I didn’t want to write a couple of hundred lines of code for some simple scraping, I decided to go along with WebDriver. This is WebDriver’s description:

WebDriver has a simple API designed to be easy to work with and can drive both real browsers, for testing javascript heavy applications, and a pure ‘in memory’ solution for faster testing of simpler applications.

With webdriver, I was able to control Firefox, and my XPath worked! Just like that ! From this point on, everything worked just fine.

Even though I could have dropped XPath and still used mechanize/nokogiri for scraping, switching to WebDriver kept the task very simple. Here’s some WebDriver sample code:

public class WebDriverTest {

     * @param args the command line arguments
    public static void main(String[] args) {
        // use the default firefox profile
        WebDriver driver = new FirefoxDriver("default");
        // insert your XPath expression here
        String xpath = "xpath_expression";
        // load the page

        // find elements of the page by XPath
        List elements = driver.findElements(By.xpath(xpath));
        // if elements were found
        if(elements != null) {
            // iterate over each element
            for(WebElement element : elements) { 
                // print text

        else {
            System.out.println("nothing found");

As you can see, the code is very readable. The only thing I needed to add to this code was to place the whole XPath extraction in a loop, and clicking a button named “Next” to get to another page.

This doesn’t apply only to web scraping, but to other programming related activities. You should always explore other frameworks , because some of them will increase your productivity more than others. Knowing the “competition” can and most likely will help you, even if not now, in the long run it will be worth the time spent investing in them.

P.S: is it just me, or this blog’s title is becoming ssssssscraping? 🙂

Blog Stats

  • 259,585 hits