Audience: intermediate coders with a knowledge of PHP

PHP provides a lot of useful tools when using it for simple webpage scraping, but one begins to see its limitations when put up against a search form or website that uses hidden server-generated fields to validate user input.

Photo by n3wjack via Flickr. CC-licensed.

Such was the case when I wrote a scraper to open up Hackney Council’s planning applications and appeals system. Users wanting a listing of all planning applications submitted in the last month have to select that option and submit a form, which then takes them to a results page. Unlike most search engines, which strive for accessibility by placing the query as options in the URL, Hackney Council’s Planning Explorer generates a temporary XML file containing all the results, which is then paginated and styled via a number of other files on the results listing page. This makes it impossible to both link to a search results set. Further, hidden fields with server-generated values are placed inside the search form, without which the query will not complete.

Luckily, a helpful class called SimpleBrowser (included with the open source SimpleTest unit testing suite) makes programmatically browsing through forms like this fairly simple. Here’s how to use it with ScraperWiki:

if (!file_exists('simpletest/browser.php')){ //first need to download SimpleTest
$data = file_get_contents("");
file_put_contents("simpletest.tar.gz", $data);
exec('tar -xzvf simpletest.tar.gz');

If you’re not using ScraperWiki, just extract SimpleTest somewhere meaningful and include browser.php.

After that, open a new browser window by instantiating the class and browsing to the address you want. If you’re dealing with a site that won’t load unless you have cookies enabled (itself an annoying problem made trivial through SimpleBrowser), use SimpleBrowser’s useCookies(); method.

$browser = new SimpleBrowser();

The next step is to set the form options while retrieving any hidden form fields. Two methods are really useful for this: SimpleBrowser::getField(); and SimpleBrowser::setField();. In the code below, I use getField(); to get the values of the two hidden fields (”__VIEWSTATE” and “__EVENTVALIDATION”) and then use setField(); to set the values of the search options:

$viewstate = $browser->getField('__VIEWSTATE');
$eventValidation = $browser->getField('__EVENTVALIDATION');
$browser->setField('__VIEWSTATE', $viewstate);
$browser->setField('__EVENTVALIDATION', $eventValidation);
$browser->setField('cboSelectDateValue', 'DATE_RECEIVED');
$browser->setField('cboMonths', '1');
$browser->setField('rbGroup', 'rbMonth');
$browser->setField('cboDays', '7');

Note that I really didn’t need to use setField(); on __VIEWSTATE and __EVENTVALIDATION — the server already set their values when SimpleBrowser loaded the page in the first place, and are thus automatically included with all the other fields when the SimpleBrowser::clickSubmitByName(); method is called (which does what it says it does: clicks the submit button). Note that I specified the submit button by name — this can be useful if there are multiple submit buttons, i.e., a “submit” and “reset” button.

Now that you’re on the next page, you can scrape the rest of it pretty easily. To simply dump the current page into a string variable (which can be then manipulated via PHP’s DOM library or any number of other ways), just use SimpleBrowser::getContent();. Alternately, the method SimpleBrowser::clickLink lets you travel to another page by specifying link text and an index. The following clicks through a series of links and adds the resulting page content to an array:

$count = $browser->getUrls(); //SimpleBrowser::getUrls(); creates an array of all URLs on page
for ($i = 0; $i clickLink(’More info...’, $i);
$content[] = $browser->getContent();
$browser->back(); //SimpleBrowser::back(); takes us back to the results page

Notice how I used two more SimpleBrowser methods here? SimpleBrowser::getUrls(); creates an array of every URL on page — really useful for results pages such as this. Now that the script knows how many links on the page there are, I run a loop that iterates through them all, using getContent(); to save the raw HTML to the $content variable. After that, I use SimpleBrowser::back(); to click my browser’s “back” button, to return to the search results page.

In closing, SimpleBrowser is a great lightweight library for emulating a web browser within PHP scripts. For more information about all of its methods, check out its documentation