Archive for the 'scraping' Category

17
Mar
09

how to submit a form programmatically

I have received some comments about submitting a form from your applications, and I’ve decided to write an article about that.

There are a number of ways to accomplish this task:

  • you can use selenium ide firefox addon to record a session — this will generate all the code for everything you do inside your browser: clicking a button, filling out a certain field. This is an easy solution if you don’t know programming
  • you can use the firefox addon tamper data to find out all the names of the fields your browser is sending, and their values as well. A downside would be that you need to submit the form at least once.
  • you can use the firefox addon firebug to find the names of the fields in a form. A downside would be that it’s very likely to miss some of them because they are hidden. This is why I would recommend using tamper data together with firebug.

I will illustrate this process with screenshots and some code:

Here is what we will be submitting:
form
And this is the html for it:


<html>
<form action="whatever.php" method="POST">
	<p>	
	<label for="user">User</label>
	<input type="text" name="user"/>
	</p>
	<p>
	<label for="pwd">Pass</label>
	<input type="password" name="pwd"/>
	</p>
	<input type="submit" value="login"/>
</form>
</html>


( please, don’t even bother telling me that this html code doesn’t respect the standards. I don’t care. This is for learning purposes only )

It’s a simple form made of three fields : username, password and the submit button. Open your favorite text editor and paste it in. Save the buffer to a file ending with .html extension, then open it in your browser.

I hope you installed tamper data and firebug, because now we’ll make use of them. We’ll start with firebug. If you’ve installed it, a bug like icon will appear in the lower right corner of the browser. If it’s coloured gray, it means it’s disabled, and you have to click it and enable all of it’s features. If you’ve succeeded in doing that, the icon should be now orange, with black stripes.

Right click the user field. The contextual menu should have the option “Inspect Element”, like in the following screenshot:
contextual
Click it. You should now see something resembling this picture:
firebug
Notice that the field’s name is “user”. If you do the same for the password field, you’ll see that it’s name is “pass”. In this example, this is redundant, because we already know the name of the fields. However, in the real-world, you will not, and you should follow the steps showed here. Here is the code we have so far :


require "rubygems"
require "mechanize"

mech = WWW::Mechanize.new
# i'm loading this file locally
# in real-life you would provide the url of the page containing the form you want to submit
mech.get("file:///test_files/form_test.html")
# obtain the form object
# because this page contains only one form, it's obvious we request the first one
# if the page contained more than one form, you would have iterated over the forms
# and selected the one containing the fields you needed
form = mech.page.forms.first
# and now we complete the fields
# username first
# the order in which you complete this form is not important
form.user = "geo"
# and now the password
form.pwd = "mypassword"
# submit the form
form.submit
# do whatever you want to with the returned page
puts mech.page.body

If you run this code you’ll notice that it works ( that is, if you configured the action parameter to something real. If you haven’t, you’ll get a 40* error code, which still means that it works – this error will appear because the script needed to handle the form wasn’t found )

Usually, before submitting a form, you should use tamper data to make sure you’re sending all the parameters. So, open the website in firefox, fill out all the fields in the form, go to the “Tools” menu entry of your browser, click “Tamper Data”, like in the following screenshot :
tamper
If you did this, a new window will appear on your desktop :
tamper1
Click “Start tamper”, and then submit your form ( click on login/submit/search/whatever ). After you’ve done this, something like this will appear :
tamper2
Click Tamper. This is what you will see next :
tamper3

In this example, this is exactly what we expected to see. Just the user and pwd fields are sent. However, in the real-world, you’ll see that usually more parameters are needed. Use tamper data before you start writing your code.

I like using mechanize for this sort of stuff, because it really makes this sort of tasks easy for you to handle. You can apply what you’ve learned here to whatever “mechanize-like framework”.

kthxbai

04
Mar
09

faster way to find a cookie

In my previous post, https://ssscripting.wordpress.com/2009/02/24/how-to-scrape-data-from-sites-you-cant-log-into/, I showed how you can use a cookie from the web browser to speed up the scraping process. However, due to the fact that a website can have more than a cookie, this process is a bit error prone ( because you have to concatenate all the values in a string ). While testing something with netcat, I found a faster way, that is virtually without errors.

All you have to do is to start netcat and set it to listen on a specific port. Here’s how you do that:

nc -l -p 9000

Next, you have to configure your browser’s proxy to “localhost” , port 9000 ( or whatever port you specified ). Here is how you do it in firefox . Go to Options/Preferences, and then get to this screen:
proxy
From here, click on Settings, and fill in the proxy related details. After you’ve done this, visit the site to which you want to find the cookie. Look in the terminal/console you opened nc in and you will see the HTTP request. Look for the Cookie header, and copy it. From here on, you can follow the steps in the other article.

24
Feb
09

how to detect spiders/web crawlers

In the previous posts, I’ve written about the techniques one could use to perform web scraping. I feel it’s important that developers know how to detect spiders and how to restrict them.

I think that the StackOverflowquestion “How do you stop scripters from slamming your website hundreds of times a second?” compiles the best information related to this topic. You can read the whole thing here.

19
Feb
09

don’t be afraid to try other frameworks

I really love open source. If open source would not exist, I think my productivity would drop a lot. The great thing is that you always have alternatives. If you don’t want to be coding in a specific language, there’s a big chance you’ll find something similar for another language. For example, mechanize is available for perl/ruby/python:

Because each library is built upon different other libraries, stuff won’t always work out the way you want them to.

For example, today I was trying to scrape a page using XPath. I chose mechanize and nokogiri, and I obtained the XPath expression with XPath checker. However, I wasn’t getting back the stuff I needed. In fact, I wasn’t getting back any results.

The first thing I tried was to switch from nokogiri to REXML. The problem didn’t stop here. REXML raised an exception, from which I deduced that the page wasn’t well formed. Usually, in situations like this, you would want to use tidy to clean it up. Surprisingly, this didn’t fix it either, REXML still raised the same exception.

I usually use mechanize for all my scraping needs, but using something different isn’t out of my comfort zone. So, I thought I would try something Java based. Because I didn’t want to write a couple of hundred lines of code for some simple scraping, I decided to go along with WebDriver. This is WebDriver’s description:

WebDriver has a simple API designed to be easy to work with and can drive both real browsers, for testing javascript heavy applications, and a pure ‘in memory’ solution for faster testing of simpler applications.

With webdriver, I was able to control Firefox, and my XPath worked! Just like that ! From this point on, everything worked just fine.

Even though I could have dropped XPath and still used mechanize/nokogiri for scraping, switching to WebDriver kept the task very simple. Here’s some WebDriver sample code:

public class WebDriverTest {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        // use the default firefox profile
        WebDriver driver = new FirefoxDriver("default");
        
        // insert your XPath expression here
        String xpath = "xpath_expression";
        
        // load the page
        driver.get("http://some_site");

        // find elements of the page by XPath
        List elements = driver.findElements(By.xpath(xpath));
        
        // if elements were found
        if(elements != null) {
            
            // iterate over each element
            for(WebElement element : elements) { 
                // print text
                System.out.println(element.getText()); 
            }

        }
        else {
            System.out.println("nothing found");
        }
    }

As you can see, the code is very readable. The only thing I needed to add to this code was to place the whole XPath extraction in a loop, and clicking a button named “Next” to get to another page.

This doesn’t apply only to web scraping, but to other programming related activities. You should always explore other frameworks , because some of them will increase your productivity more than others. Knowing the “competition” can and most likely will help you, even if not now, in the long run it will be worth the time spent investing in them.

P.S: is it just me, or this blog’s title is becoming ssssssscraping? 🙂

15
Feb
09

web scraping techniques


At one time or another , every developer has to extract data from multiple pages . I know most of you guys use regexes . I do too sometimes . I even know someone who used php’s explode function ( hello Andrei ! ) and then extracted the results . I will show you some of the techniques used to extract data , along with some of the advantages and disadvantes of each one .
In this tutorial I will show you how to scrape the name of the categories in this blog . They are shown in the picture below :

categories1

I am sorry I didn’t capture all the categories in the picture , but I think you got the point . It usually takes me a couple of tries until I can figure out how to extract the data of a website . It would help ( and it would be better for the website that you’re scraping ) if you would write the response to a file ( or serialize it ) and work on the local version . This way , you wouldn’t make unnecessary requests . Here’s what I’m talking about :


require "rubygems"
require "mechanize"

site = ARGV[0] || (abort "I need a link")
mech = WWW::Mechanize.new
mech.get(site)

File.open("body_serialized.ser","w") do |file|
   Marshal.dump(mech.page.body,file)
end

The script above will request a page and store it’s content in a file named body_serialized.ser . Until we can figure out what technique we’re going to use to parse the data , we’ll work with the string stored in the body_serialized.ser file .

  • Technique no.1 : Text searching ( no regexes )

    This is one of the most used techniques out there . Most of the time , it’s used by people who don’t understand regular expressions . This is one of the techniques that works on trial and error . Here is the first try :

    
    # first try
    text = ""
    
    File.open("body_serialized.ser") do |file|
    	text = Marshal.load(file)
    end
    
    text.each_line do |line|
    	line = line.downcase
    	if(line.index("title=\"view all posts"))
    		contents = line.split(" ")
    		interesting = contents[9]
    		puts interesting
    	end	   
    end
    

    Running the script produces the following output :

    
    
    <a
    nil
    http">http</a>
    java">java</a>
    music">music</a>
    ruby">ruby</a>
    spider">spider</a>
    swing">swing</a>
    synchronization">synchronization</a>
    threads">threads</a>
    videos">videos</a>
    

    It’s a start ! We’ll have to work a bit on this one … here’s the script version that extracts what we want :

    
    # first try
    text = ""
    
    File.open("body_serialized.ser") do |file|
    	text = Marshal.load(file)
    end
    
    text.each_line do |line|
    	line = line.downcase
    	if(line.index("title=\"view all posts"))
    		contents = line.split(" ")
    		if(contents.size == 10)
    			interesting = contents[9]
    			quote_position = interesting.index("\"")
    			if(!quote_position.nil?)
    				interesting = interesting[0,quote_position]
    				puts interesting
    			end
    		end
    	end	   
    end
    
    

    Running this script produces the following output :

    
    http
    java
    music
    ruby
    spider
    swing
    synchronization
    threads
    videos
    

    And that was exactly what we needed ! Unlike the first version of this script , this one checks to see if after splitting a line into tokens , we have 10 tokens returned . The reason I did this is to filter out the categories that appear in posts . For example , if I didn’t had this condition , the script would also try to parse the categories from my post , and it would fail , because the html code for them is a little bit different .

    Advantages of this technique :

    • You don’t need to understand regexes
    • You don’t need any third-party library/framework/tool
    • Pretty easy to do if you know a bit of programming

    Disadvantages of this technique :

    • Searching text in this way takes a lot of tries to get to the right results
    • If the site’s html changes it’s structure , the script is unusable ( you would have to do this all over again )
  • Technique no.2 : Text searching ( with regexes )
  • If you understand regexes , this job will be a lot simpler & faster . This site is one of the best sites related to regular expressions . If you don’t know them , I would recommand you pay it a visit . Here’s the script that extracts the categories using regular expressions :

    
    text = ""
    
    File.open("body_serialized.ser") do |file|
    	text = Marshal.load(file)
    end
    
    matches = text.scan(/title=\"View all posts.*?>(.*?)<\/a>/i)
    matches.each do |match|
    	puts match
    end
    

    Running this script produces exactly the results we’re after . If you don’t believe me , try it !

    Advantages of this technique

    • You get to write less code
    • Development speed greatly increases ! This script took me about 2 minutes to write ( and it worked the first time ) , while the script for the first technique took about 10 minutes ( and I wrote it using trial and error ) .
    • The regex is pretty easy to replace ( you don’t have to modify the whole script to make it work if the site changes it’s structure )

    Disadvantages of this technique

    • You would have to know regexes to use this technique ( duh 🙂 )
    • A lot of developers don’t know how to use them/find them difficult to use
  • Technique no.3 : XPath
  • In order to get the most out of this technique , I assume you’re using firefox . You would have to install the following addon XPath checker. After you installed the addon , navigate to this blog , right click on one of the categoriess , and select “View XPath” , like in the following picture :

    snapshot1

    A window like this will appear :
    xpath_window

    As you can see in the window , we have the xpath expression for music . Notice that if you manipulate the xpath expression , you will see the “matches” in the window . In this example , we want to find the other categories as well . The xpath expression the window shows is this : id(‘categories-352220371’)/ul/li[3]/a , which loosely translates to :

    • get me the the link who is a child of the third li , who is the child of an unordered list , who is a child of something with the id of categories-352220371

    We can see that the li part looks like an array of some sort . If we check with the page , we can see that the music category is indeed the third category . So , what do you think will happen if we replace li[3] with li ? You guessed right , we get a list of all the categories . I think you will agree with me when I say that this technique is very simple and effective . Here is the code that extracts the categories using xpath :

    
    require "rubygems"
    require "nokogiri"
    
    text = ""
    
    File.open("body_serialized.ser") do |file|
    	text = Marshal.load(file)
    end
    
    doc = Nokogiri::HTML(text)
    doc.xpath("id('categories-352220371')/ul/li/a").each do |category|
    	puts category.text.chomp
    end
    

    Advantages of this technique

    • Development speed is greatly enhanced ! It took me 2 minutes to write the script once I found the xpath expression .
    • The xpath expression is very easy to replace ! ( a matter of seconds )
    • Both firefox and the xpath checker addon are free to use

    Disadvantages of this technique

  • Technique no.4 : scraping using CSS ( kind of )
  • You could scrape a web page “with style” . This doesn’t mean you should wear an Armani suit while writing your code 🙂 . It means you could use the CSS selectors to find the information you need . Since I’m not a web developer , and I know extremely little CSS , I won’t show you how to use this technique ( because I don’t know how to use it either ) . It’s sufficient to know that it exists . You can find more informations about it here .
    Advantages of this technique

    • if you know how to use CSS , I think your development speed would increase
    • I think scripts using this technique won’t be modified as often as the others . From what I know , a page’s style doesn’t change that often.

    Disadvantages of this technique

    • You must know CSS

Perhaps other web scraping techniques exist , perhaps they are better than the ones I showed here . I don’t know . These are the ones I use ( I don’t really use the first one now ) , and they work for me . I hope you enjoyed this tutorial !

See you around !

Geo




Blog Stats

  • 218,326 hits