RSS Feed for This PostCurrent Article

PHP: Write a Web Page Scraper

Your Ad Here

Download Sample Code

I have not done many PHP projects but I always think that PHP is really good for web programming even though I mostly program in Java and .NET.

I am not a good PHP programmer, but to write a simple web page scraper in PHP is very straightforward.

First, I defined my HTTP get/post function

function http_post_content($url,$data) {
	$data = http_build_query($data);
	$aContext = array(
		'http' => array(
		'proxy' => 'proxyserver:8080',
		'request_fulluri' => True,
		'method'=>'POST',
		'header'=>
                "Content-type:  application/x-www-form-urlencoded\r\n"
		."Content-Length: " . strlen($data) . "\r\n",
		'content' => $data
				),
	    	);
	$cxContext = stream_context_create($aContext);
	$content = file_get_contents($url, FILE_TEXT, $cxContext);
	return $content;

	/*
	$fp = @fopen($url, 'rb', false, $cxContext);
	if (!$fp) {
	    throw new Exception("Problem with $url, $php_errormsg");
	}
	$content = @stream_get_contents($fp);
	if ($content === false) {
		throw new Exception("Problem reading data
                from $url, $php_errormsg");
	}
	*/
}

In the http array I can define the proxy server settings, method which can be post or get, and the content.

I use file_get_contents to retrieve the web page instead of cURL, which is more powerful.

To test it,


$url="http://biz.thestar.com.my/marketwatch/main.asp";
$data=array('bns'=>'2',
        	'clp'=>'1',
        	'klseViewDate'=>'1/30/2008'
        	);
$content = http_post_content($url,$data);
$htmlDoc = new DomDocument();
@$htmlDoc->loadHTML($content);
$xPath = new DOMXPath($htmlDoc);
$counters = $xPath->evaluate('//table[@id="Table2"]
/tr/td[2]/center/table/tr/td/span
[@class="text"]/table/tr/td/a');
for ($i = 0; $i < $counters->length; $i++){
	print($counters->item($i)->nodeValue.’‘);
}

I pass in the URL and the post data, then I use XQuery to retrieve the information that I want.

Note the HTML may not be well-formed, so I suppress the warning by prefixing a @ in front.

One of the catch here is the correct XQuery/XPath to be used. You can always find out using Solvent, which is a very good FireFox plug-in for web page scraping.


Trackback URL


RSS Feed for This Post3 Comment(s)

  1. Ruben Zevallos Jr. | Mar 8, 2008 | Reply

    Very nice article… I did my Web Scraping, but it have to navigate and get info from lot’s of web sites… I’m looking for some way to do it without to much effort…

    Thank you for you ideas…

  2. Caio Iglesias | Jun 18, 2008 | Reply

    Running your code exactly as it is doesn’t work on my server. Any ideas?

  3. admin | Jun 18, 2008 | Reply

    The website layout is changed. The XQuery has to be modified accordingly.

1 Trackback(s)

  1. From Electron Soup : Learning page scraping and mashups | Jul 23, 2008

RSS Feed for This PostPost a Comment