As a disclaimer, you are responsible of how you use the information I’m showing you here, this is only illustrated as a learning experience.There are many sites for which writing some code that logs into them is very difficult, and this is why we will use the browser’s cookies. This way, you can focus only on the scraping task.
IMPORTANT:I am using firefox for this tutorial.
You should know that websites identify you by the use of cookies. This is wikipedia’s definition:
HTTP cookies, more commonly referred to as Web cookies, tracking cookies or just cookies, are parcels of text sent by a server to a Web client (usually a browser) and then sent back unchanged by the client each time it accesses that server. HTTP cookies are used for authenticating, session tracking (state maintenance), and maintaining specific information about users, such as site preferences or the contents of their electronic shopping carts. The term “cookie” is derived from “magic cookie,” a well-known concept in UNIX computing which inspired both the idea and the name of HTTP cookies. Tracking cookies track your web browsing habits. They can collect information about pages and advertisements you have seen or any other activity during browsing. Different websites can share tracking cookies, and each website with the same tracking cookie can read the information and write new information into it.
You can read more about them here.
Here is what you have to do:
- open the site that you want to scrape
- login
- install the firefox addon Cookie Monster
- restart firefox
- in firefox’s bottom right corner you should now have an icon with the letters CM ( big blue C ). Left click it. You should see something like this :

- click View cookies , and select the first option “Show cookies for [whatever site you're visiting now]“
- You should now see all the cookies for the site. Here’s what it looks like when I visit google
- in order to appear as logged in, the cookie header should contain the cookies that Cookie Monster is showing. So, all you have to do is to create a string, to which you append them in the following format: cookie_name=cookie_value;other_cookie_name=other_cookie_value . As you can observe, the cookies are separated by a semicolon. I don’t think the order is important, I’m adding them in the order Cookie Monster shows them.
- By adding that cookie to each request you make, you will appear as logged in, and you perform the scraping you desire. Here is some Java code that demonstrates the use of this:
import java.io.*; import java.net.*; public class CookieChecker { public static void main(String[] args) throws Exception { URLConnection con = new URL("http://some_site").openConnection(); con.setAllowUserInteraction(false); con.setDoOutput(true); con.addRequestProperty("Cookie", "cookie_name1=cookie_val1;cookie_name2=cookie_val2;"); BufferedReader br = new BufferedReader(new InputStreamReader(con.getInputStream())); String line = ""; while((line = br.readLine()) != null) { System.out.println(line); } } } - you should know that this isn’t the only way to accomplish this task. You could also “automate” Firefox from your code. Check out this post I wrote to see how to do it using WebDriver.
- the next step is to choose a scraping technique, and build your scraper along it. Check out web scraping techniques for more information.
Hopefully, this tutorial will help you in your scraping tasks.
See you later!
UPDATE: it seems further explanations are needed
. On a usual web scraping task I follow these steps ( and I’m sure most of you do the same thing )
- navigate to the login page
- you inspect the fields with firebug, in order to find out the name of the form they belong to, as well as their names/id’s
- you login and you look for signs that login was successful ( for example, something on the page that says “welcome back,user”
- you start writing a mechanize/selenium/(insert your favorite framework here) script
- you write a regex that will check for the welcome message
- if everything was successful, you go ahead with the other part
By using the cookie directly, you go from step 1 to step 6 directly. You pass GO, you collect 200$
. Personally, I’m using this when I need something done fast ( you could consider this prototyping ). Best practice would be to login normally to a website, but, I think it’s good to know you can do things even faster. It’s true that you depend on the browser, and that each time you want to use the code you would have to make sure the cookie is still valid, but, as I said before, this is just for quick & dirty development.
Some of you asked “Why Java? Use mechanize;mechanize is uber1337.” . I know mechanize is cool, I’m a big fan myself. I started to use mechanize way back, when there only was the perl version. I like the way ruby’s mechanize evolved, and I personally consider it to be the best testing framework around. So, on topic, I used java because I had NetBeans open at the time I wrote this article. Didn’t expect this answer, did you?
I wrote in a comment that when you work with a team of developers, not all the developers will have the same skills. I find it easier to show a developer a snippet of code like the one I pasted below, than to show him how to go from step 1 to step 6. The thing is, most of the time, a deadline will be present, the boss keeps pressing you, and you can’t really spare 1-2 hours to guide someone through this process.
kthxbai