RSS Feed for This PostCurrent Article

PHP: Write a Web Page Scraper

Download Sample Code

I have not done many PHP projects but I always think that PHP is really good for web programming even though I mostly program in Java and .NET.

I am not a good PHP programmer, but to write a simple web page scraper in PHP is very straightforward.

First, I defined my HTTP get/post function

function http_post_content($url,$data) {
        $data = http_build_query($data);
        $aContext = array(
                'http' => array(
                'proxy' => 'proxyserver:8080',
                'request_fulluri' => True,
                "Content-type:  application/x-www-form-urlencoded\r\n"
                ."Content-Length: " . strlen($data) . "\r\n",
                'content' => $data
        $cxContext = stream_context_create($aContext);
        $content = file_get_contents($url, FILE_TEXT, $cxContext);
        return $content;

        $fp = @fopen($url, 'rb', false, $cxContext);
        if (!$fp) {
            throw new Exception("Problem with $url, $php_errormsg");
        $content = @stream_get_contents($fp);
        if ($content === false) {
                throw new Exception("Problem reading data
                from $url, $php_errormsg");

In the http array I can define the proxy server settings, method which can be post or get, and the content.

I use file_get_contents to retrieve the web page instead of cURL, which is more powerful.

To test it,

$content = http_post_content($url,$data);
$htmlDoc = new DomDocument();
$xPath = new DOMXPath($htmlDoc);
$counters = $xPath->evaluate('//table[@id="Table2"]
for ($i = 0; $i < $counters->length; $i++){

I pass in the URL and the post data, then I use XQuery to retrieve the information that I want.

Note the HTML may not be well-formed, so I suppress the warning by prefixing a @ in front.

One of the catch here is the correct XQuery/XPath to be used. You can always find out using Solvent, which is a very good FireFox plug-in for web page scraping.

Trackback URL

RSS Feed for This Post4 Comment(s)

  1. Ruben Zevallos Jr. | Mar 8, 2008 | Reply

    Very nice article… I did my Web Scraping, but it have to navigate and get info from lot’s of web sites… I’m looking for some way to do it without to much effort…

    Thank you for you ideas…

  2. Caio Iglesias | Jun 18, 2008 | Reply

    Running your code exactly as it is doesn’t work on my server. Any ideas?

  3. admin | Jun 18, 2008 | Reply

    The website layout is changed. The XQuery has to be modified accordingly.

  4. stevo | Dec 29, 2008 | Reply

    I’m a bit confused on this. A big part of the article focuses on getting the web page, which IMO is simple (Even an fopen($url) should suffice).

    But I am a bit confused on how you are extracting/scraping the data. Can you please provide more information into Xpath, how you used it in PHP, and how it works in general? I tried wikipedia without too much luck (on xQuery and xPath)

1 Trackback(s)

  1. From Electron Soup : Learning page scraping and mashups | Jul 23, 2008

RSS Feed for This PostPost a Comment