PHP: Write a Web Page Scraper

I have not done many PHP projects but I always think that PHP is really good for web programming even though I mostly program in Java and .NET.

I am not a good PHP programmer, but to write a simple web page scraper in PHP is very straightforward.

First, I defined my HTTP get/post function

function http_post_content($url,$data) {
        $data = http_build_query($data);
        $aContext = array(
                'http' => array(
                'proxy' => 'proxyserver:8080',
                'request_fulluri' => True,
                "Content-type:  application/x-www-form-urlencoded\r\n"
                ."Content-Length: " . strlen($data) . "\r\n",
                'content' => $data
        $cxContext = stream_context_create($aContext);
        $content = file_get_contents($url, FILE_TEXT, $cxContext);
        return $content;

        $fp = @fopen($url, 'rb', false, $cxContext);
        if (!$fp) {
            throw new Exception("Problem with $url, $php_errormsg");
        $content = @stream_get_contents($fp);
        if ($content === false) {
                throw new Exception("Problem reading data
                from $url, $php_errormsg");

In the http array I can define the proxy server settings, method which can be post or get, and the content.

I use file_get_contents to retrieve the web page instead of cURL, which is more powerful.

To test it,

$content = http_post_content($url,$data);
$htmlDoc = new DomDocument();
$xPath = new DOMXPath($htmlDoc);
$counters = $xPath->evaluate('//table[@id="Table2"]
for ($i = 0; $i < $counters->length; $i++){

I pass in the URL and the post data, then I use XQuery to retrieve the information that I want.

Note the HTML may not be well-formed, so I suppress the warning by prefixing a @ in front.

One of the catch here is the correct XQuery/XPath to be used. You can always find out using Solvent, which is a very good FireFox plug-in for web page scraping.

