19
Feb
09

don’t be afraid to try other frameworks


I really love open source. If open source would not exist, I think my productivity would drop a lot. The great thing is that you always have alternatives. If you don’t want to be coding in a specific language, there’s a big chance you’ll find something similar for another language. For example, mechanize is available for perl/ruby/python:

Because each library is built upon different other libraries, stuff won’t always work out the way you want them to.

For example, today I was trying to scrape a page using XPath. I chose mechanize and nokogiri, and I obtained the XPath expression with XPath checker. However, I wasn’t getting back the stuff I needed. In fact, I wasn’t getting back any results.

The first thing I tried was to switch from nokogiri to REXML. The problem didn’t stop here. REXML raised an exception, from which I deduced that the page wasn’t well formed. Usually, in situations like this, you would want to use tidy to clean it up. Surprisingly, this didn’t fix it either, REXML still raised the same exception.

I usually use mechanize for all my scraping needs, but using something different isn’t out of my comfort zone. So, I thought I would try something Java based. Because I didn’t want to write a couple of hundred lines of code for some simple scraping, I decided to go along with WebDriver. This is WebDriver’s description:

WebDriver has a simple API designed to be easy to work with and can drive both real browsers, for testing javascript heavy applications, and a pure ‘in memory’ solution for faster testing of simpler applications.

With webdriver, I was able to control Firefox, and my XPath worked! Just like that ! From this point on, everything worked just fine.

Even though I could have dropped XPath and still used mechanize/nokogiri for scraping, switching to WebDriver kept the task very simple. Here’s some WebDriver sample code:

public class WebDriverTest {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        // use the default firefox profile
        WebDriver driver = new FirefoxDriver("default");
        
        // insert your XPath expression here
        String xpath = "xpath_expression";
        
        // load the page
        driver.get("http://some_site");

        // find elements of the page by XPath
        List elements = driver.findElements(By.xpath(xpath));
        
        // if elements were found
        if(elements != null) {
            
            // iterate over each element
            for(WebElement element : elements) { 
                // print text
                System.out.println(element.getText()); 
            }

        }
        else {
            System.out.println("nothing found");
        }
    }

As you can see, the code is very readable. The only thing I needed to add to this code was to place the whole XPath extraction in a loop, and clicking a button named “Next” to get to another page.

This doesn’t apply only to web scraping, but to other programming related activities. You should always explore other frameworks , because some of them will increase your productivity more than others. Knowing the “competition” can and most likely will help you, even if not now, in the long run it will be worth the time spent investing in them.

P.S: is it just me, or this blog’s title is becoming ssssssscraping? :)

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Blog Stats

  • 176,394 hits

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: