Creating a PHP Screen Scraper

So, I guess when most people get a 3 week holiday break from work, the first thing they do is try to get away from the laptop.. where is the fun in that I ask you?

This x-mas I took a small trip around the world of PHP and took the opportunity to write some enhancements for my better half’s online retail website.  PHP is very C-like in it’s syntax, but one of the easier languages to get around if you’re not working with coding on a day to day basis.

In my 3 weeks, I essentially wrote a plug-in to the aforementioned website which automatically logs in to other websites and intelligently processes the pages of tabular data behind the authenticated curtain. It then takes this tabular data and squirts it out into an XML file, which is then used for some reporting.

If you are looking to understand a bit about how screen-scraping works, or how to mimic interactive HTTP requests to a web server, this blog may be useful for you.

OK, let’s crack on.

Just like the A-Team trapped in some garage, in just about every episode I can remember, we need tools. Never did understand why the bad guys always stuck them somewhere where they could practically build an F16 fighter plane out of what was there, but I suppose that was part of the magic.

What you will need in your toolbox:

  • A LAMP, WAMP or XAMPP (e.g. www.apachefriends.org) web development environment.
  • The following PHP extensions enabled: cURL and HTMLTidy.
  • A target website.
  • An account at the target website that you need to log in to, to be served up with your tabular data.
  • A copy of a protocol analyzer or other similar packet sniffer (I’m using Wireshark – www.wireshark.org)
  • Peace, quiet, patience and determination.

Our process is going to run a little bit like this:

  1. Learn the behaviour of the target website.
  2. Use cURL to mimic the client side behaviour of the site.
  3. Sanitize the tabular data into a nice HTML table.
  4. Access the HTML as if it was an XML object.

1: First Up: Learning about the target

Websites are simple beasts, and so is the HTTP protocol which is used to facilitate communication between web-server and web-browser.  When you click a button on a web page, this is usually associated with some kind of request. You are effectively making the web-browser ask the web-server to do something. The website receives this request, decides what to do with it and then sends a response back. The response usually comes in the form of another web-page.

The question is how do we actually know what the browser and website are saying to each other behind the scenes. The great thing about HTTP is that the requests are text based. So if you know what you are looking for it’s relatively easy to understand the communication between the two. This is where our first tool comes in Wireshark, our protocol analyser.

Wireshark’s pretty easy to install (click next, click next, etc) and you should do this on the same machine as your web-browser.  Wireshark is going to capture the network data from our network card as it is generated by the browser.  The intricacies of Wireshark are out of the scope of this blog. In short, you want to select your network interface and start capture here:

Seconds after clicking start capture, you will redirected to the capture screen and most likely start getting inundated with masses of what we call “network crap”:

To ensure we only see what we are looking for, in the filter box, next to “Filter:” type “http”. This will filter anything other HTTP out of the view, so we are good to go.

Open up a browser of your choice (IE, Firefox, Chrome, whatever) and head over to your target site.  You should start to see HTTP packets bouncing back and forth. Drill down into some of these and see if you can see what’s happening down there.  As I said earlier, HTTP is pretty simple so you should get a picture of what requests and responses are sent and received with each action you perform.

You now need to start to build the steps that your HTTP bot will go through to get to the data you want to extract from the site. Ask yourself these questions.. Would you go to the home page of the site on any visit? Is the log-in form there, do you need to go to another page to log in? What happens when you log in? Are you redirected to another page? Like a home page? Is that where your data is, or do you have to perform a search or navigate some links to get to it?

Build up a list of steps you as a person goes through to get to the data. Your bot is going to use this same list of steps. As you go through each step, make sure you record what Wireshark is telling you. You are going to use cURL to mimic these requests later.

2: Time to get cURL to do the work for us

We now know much more about our target than we did before, and we’re ready to start a bit of coding.

cURL works with a simple 3 step process:

  1. We initialize a handle for our cURL session.
  2. We set a bunch of options on the request (e.g. URL we are connecting to).
  3. We execute the request and receive the results, either directly to the screen or we can pipe to a variable.

Here’s a simple block of code which will go to www.google.com, put the results in a variable and then print that variable to screen:

$ch = curl_init(); // Initialize Curl Handle
curl_setopt($ch, CURLOPT_URL,”http://www.google.com“); // Set cURL options
$contents = curl_exec($ch);  // retrieve Google Home page to initialise a session
echo $contents; // print the contents in the web-browser.

Congrats, if you put this in a .php script and run it in your browser, you just made a simple but functional bot. At this stage we haven’t closed the session with curl_close(); so if we reset the options with setopt and execute the handle again, we will get the next page we request. It’s as if a user is at the browser clicking away in the same session.

cURL has a vast array of setopt options. It even supports HTTPS. I’d recommend browsing through these at: http://www.php.net/manual/en/function.curl-setopt.php

cURL does support HTTP authentication, but most sites use application based auth. This means you usual put a username and password into a form on a web-page, then click submit to send a POST request to the server. cURL can replicate all of this.

It is actually possible to find some details on what the form will send to the server, by right clicking the page containing the form in your browser and selecting “view page source”. You can record the names of the username and password fields and ensure you use these in your cURL POST request.

You may be asking at this stage, if you can view the fields sent to authenticate why use Wireshark? Well, some websites like to put extra checks on requests they receive to try to beef up security. For websites like these, you may want to include cURL options such as CURLOPT_USERAGENT and also CURLOPT_REFERER. These can help convince the site that you’re not a nasty bot trying to do nasty things.

We have to work through each step of our reconstructed navigation of the site here. You can add each naviagtion stage, then run it with output to the screen, and hopefully you will eventually get through to the page that holds the data.

3: Sanitizing our table data 

Ok, so now we have our html page that holds our table of data in a variable. How do we get at the data?

Well, unfortunately a HTML page will never come just as the table, there will always be some kind of header and footer formatting.  At this stage we have to use our “view page source” tool again and find some kind of unique text string in the HTML, that identifies the start and end of the table. Once we have this we can run a simple function to trim around the table.  In the words of any well-known TV chef, “here is one I prepared earlier..”:

function trim_page($page,$headertxt,$footertxt)
{
// Trim header and start of page information.
$trimfind=$headertxt;
$trimpos=strpos($page,$trimfind);

$trimmedpage = substr($page,$trimpos+strlen($trimfind),(strlen($page)-$trimpos));

// Trim footer and end of page information.
$trimfind=$footertxt;
$trimpos=strpos($trimmedpage,$trimfind);

$trimmedpage = substr($trimmedpage,0,$trimpos);

return $trimmedpage;
}

Hey presto, we have a table.

4: Accessing the data in our table, as though it was XML

So, how do we turn our HTML table in XML.. hmmm, tricky. Well not really tricky at all, as HTML is already in XML format. So all we really need to do is make sure that the HTML table is well-formed XML. Then we can get at the data.

To do this we use HTMLtidy. It’s another simple function used:

function tidy_page($page)
{
$tidy = tidy_parse_string($page);
tidy_clean_repair($tidy);
$tidiedpage = tidy_get_output($tidy);

return $tidiedpage;
}

We now have a well-formed XML document which contains a table data, which we can use with standard PHP XML manipulation tools such as DOM and SimpleXML.

For my purposes I loaded my XML into a DOM document like this:

$doc = new DOMDocument();
$doc->strictErrorChecking = FALSE;
$doc->loadHTML($html);

Then converted from DOM to SimpleXML like this:

$xml= simplexml_import_dom($doc);

I then used an Xpath Query to load <tr> table rows into SimpleXMLelement Object

$trowsxml = $xml->xpath(“body/table/tr”);

And just like magic I now have a SimpleXML object from which I can access each row and column of the table, just as though it was an array.

Where you take the data from here is up to you.  Using cURL and Wireshark can be a powerful combination of tools. This is a very basic example of what can be done, my plug-in also included some logic to navigate multiple pages with a table on each, plus an additional function to aggregate the lot at the end.  Automating the retrieval of data from the web like this could potentially save you hours of manual scanning, screen by screen. If you are checking tabular data, whether that be sports scores or perhaps supplier price-lists, spending a little effort on creating some scripts like this could save you a lot of time.

Disclaimer: Screen-scraping may be against the acceptable use policy of your target website. It could get you banned, blocked or even legal action could be taken.  If this is the case, you should not attempt to retrieve data in this way. This blog is for information only and should you end up in court, I will be there with you… Not sharing liability, just saying “I told you so”.

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>