RSS Feed for This PostCurrent Article

PHP: Write a Web Page Scraper

Download Sample Code

I have not done many PHP projects but I always think that PHP is really good for web programming even though I mostly program in Java and .NET.

I am not a good PHP programmer, but to write a simple web page scraper in PHP is very straightforward.

First, I defined my HTTP get/post function

function http_post_content($url,$data) {
        $data = http_build_query($data);
        $aContext = array(
                'http' => array(
                'proxy' => 'proxyserver:8080', 
                'request_fulluri' => True, 
                'method'=>'POST',
                'header'=> 
                "Content-type:  application/x-www-form-urlencoded\r\n"
                ."Content-Length: " . strlen($data) . "\r\n",
                'content' => $data
                                ),        
                );    
        $cxContext = stream_context_create($aContext);
        $content = file_get_contents($url, FILE_TEXT, $cxContext);
        return $content;
        
        /*
        $fp = @fopen($url, 'rb', false, $cxContext);
        if (!$fp) {
            throw new Exception("Problem with $url, $php_errormsg");
        }
        $content = @stream_get_contents($fp);
        if ($content === false) {
                throw new Exception("Problem reading data 
                from $url, $php_errormsg");
        }
        */
}

In the http array I can define the proxy server settings, method which can be post or get, and the content.

I use file_get_contents to retrieve the web page instead of cURL, which is more powerful.

To test it,


$url="http://biz.thestar.com.my/marketwatch/main.asp";
$data=array('bns'=>'2',
                'clp'=>'1',
                'klseViewDate'=>'1/30/2008'
                );
$content = http_post_content($url,$data);
$htmlDoc = new DomDocument();
@$htmlDoc->loadHTML($content);
$xPath = new DOMXPath($htmlDoc);
$counters = $xPath->evaluate('//table[@id="Table2"]
/tr/td[2]/center/table/tr/td/span
[@class="text"]/table/tr/td/a');
for ($i = 0; $i < $counters->length; $i++){
        print($counters->item($i)->nodeValue.'
'); }

I pass in the URL and the post data, then I use XQuery to retrieve the information that I want.

Note the HTML may not be well-formed, so I suppress the warning by prefixing a @ in front.

One of the catch here is the correct XQuery/XPath to be used. You can always find out using Solvent, which is a very good FireFox plug-in for web page scraping.


Trackback URL


RSS Feed for This Post5 Comment(s)

  1. Ruben Zevallos Jr. | Mar 8, 2008 | Reply

    Very nice article… I did my Web Scraping, but it have to navigate and get info from lot’s of web sites… I’m looking for some way to do it without to much effort…

    Thank you for you ideas…

  2. Caio Iglesias | Jun 18, 2008 | Reply

    Running your code exactly as it is doesn’t work on my server. Any ideas?

  3. admin | Jun 18, 2008 | Reply

    The website layout is changed. The XQuery has to be modified accordingly.

  4. stevo | Dec 29, 2008 | Reply

    I’m a bit confused on this. A big part of the article focuses on getting the web page, which IMO is simple (Even an fopen($url) should suffice).

    But I am a bit confused on how you are extracting/scraping the data. Can you please provide more information into Xpath, how you used it in PHP, and how it works in general? I tried wikipedia without too much luck (on xQuery and xPath)
    thanks

  5. glen | Dec 16, 2009 | Reply

    Hi
    I am looking for some way of pulling dynamic content off a web page and displaying it in a basic table format on a webpage. We have over 100 remote locations with a printer/NAS/scanner/WAP etc and each device has a web interface with a “status”. This status is dynamic(obviously) and I have a simple html page with 100 odd rows, and 10 columns displaying the status of each device at each location.
    I started using iFrames to display the status, which works, but is very time consuming to load the page as it is effectively loading 100×10 pages.
    Is there some (simple) method of pulling this dynaic part from each page to the ‘cell’ on my status page?

    cheers

    Glen

1 Trackback(s)

  1. From Electron Soup : Learning page scraping and mashups | Jul 23, 2008

Sorry, comments for this entry are closed at this time.