PHP: Write a Web Page Scraper
By admin on Feb 4, 2008 in php, Programming
I have not done many PHP projects but I always think that PHP is really good for web programming even though I mostly program in Java and .NET.
I am not a good PHP programmer, but to write a simple web page scraper in PHP is very straightforward.
First, I defined my HTTP get/post function
function http_post_content($url,$data) { $data = http_build_query($data); $aContext = array( 'http' => array( 'proxy' => 'proxyserver:8080', 'request_fulluri' => True, 'method'=>'POST', 'header'=> "Content-type: application/x-www-form-urlencoded\r\n" ."Content-Length: " . strlen($data) . "\r\n", 'content' => $data ), ); $cxContext = stream_context_create($aContext); $content = file_get_contents($url, FILE_TEXT, $cxContext); return $content; /* $fp = @fopen($url, 'rb', false, $cxContext); if (!$fp) { throw new Exception("Problem with $url, $php_errormsg"); } $content = @stream_get_contents($fp); if ($content === false) { throw new Exception("Problem reading data from $url, $php_errormsg"); } */ }
In the http array I can define the proxy server settings, method which can be post or get, and the content.
I use file_get_contents to retrieve the web page instead of cURL, which is more powerful.
To test it,
$url="http://biz.thestar.com.my/marketwatch/main.asp"; $data=array('bns'=>'2', 'clp'=>'1', 'klseViewDate'=>'1/30/2008' ); $content = http_post_content($url,$data); $htmlDoc = new DomDocument(); @$htmlDoc->loadHTML($content); $xPath = new DOMXPath($htmlDoc); $counters = $xPath->evaluate('//table[@id="Table2"] /tr/td[2]/center/table/tr/td/span [@class="text"]/table/tr/td/a'); for ($i = 0; $i < $counters->length; $i++){ print($counters->item($i)->nodeValue.'
'); }
I pass in the URL and the post data, then I use XQuery to retrieve the information that I want.
Note the HTML may not be well-formed, so I suppress the warning by prefixing a @ in front.
One of the catch here is the correct XQuery/XPath to be used. You can always find out using Solvent, which is a very good FireFox plug-in for web page scraping.
Ruben Zevallos Jr. | Mar 8, 2008 | Reply
Very nice article… I did my Web Scraping, but it have to navigate and get info from lot’s of web sites… I’m looking for some way to do it without to much effort…
Thank you for you ideas…
Caio Iglesias | Jun 18, 2008 | Reply
Running your code exactly as it is doesn’t work on my server. Any ideas?
admin | Jun 18, 2008 | Reply
The website layout is changed. The XQuery has to be modified accordingly.
stevo | Dec 29, 2008 | Reply
I’m a bit confused on this. A big part of the article focuses on getting the web page, which IMO is simple (Even an fopen($url) should suffice).
But I am a bit confused on how you are extracting/scraping the data. Can you please provide more information into Xpath, how you used it in PHP, and how it works in general? I tried wikipedia without too much luck (on xQuery and xPath)
thanks
glen | Dec 16, 2009 | Reply
Hi
I am looking for some way of pulling dynamic content off a web page and displaying it in a basic table format on a webpage. We have over 100 remote locations with a printer/NAS/scanner/WAP etc and each device has a web interface with a “status”. This status is dynamic(obviously) and I have a simple html page with 100 odd rows, and 10 columns displaying the status of each device at each location.
I started using iFrames to display the status, which works, but is very time consuming to load the page as it is effectively loading 100×10 pages.
Is there some (simple) method of pulling this dynaic part from each page to the ‘cell’ on my status page?
cheers
Glen
asdasdas | Aug 5, 2010 | Reply
asdasa
someone | Aug 5, 2010 | Reply
Thanks for the helpful input asdasdas, better than what qwerty said.
cheers
someone
vignesh | Aug 11, 2010 | Reply
i dont know about php , can u explain php from basic
Unterseer | Sep 4, 2010 | Reply
Running your code exactly as it is doesn’t work on my server. Any ideas?Poloshirts
FengShui | Sep 12, 2010 | Reply
Thanks for the helpful input asdasdas, better than what qwerty said.Feng Shui in der Praxis
Feng Shui
Feng Shui Regeln
Junggesellenabschied | Sep 24, 2010 | Reply
Very nice article…
reiten | Sep 24, 2010 | Reply
Reitforum
Fun T-Shirt Sprüche | Sep 24, 2010 | Reply
Great article. The code works fine for me.
Thanks
sağlık | Oct 21, 2010 | Reply
Is not nothin much to say about this page. I think a very successful and useful.
Fussbodenheizung | Nov 17, 2010 | Reply
i was looking for php in the internet for getting started and found your site accidently. It helped me a lot, thanks.
Praveen | Dec 16, 2010 | Reply
Hi,
Can you help me about how can i Scrap “http://www.indiatimes.com/” Site Latest News Slideshow Deta on our site with php.
Regards,
Praveen
approved payday loan | Jan 13, 2011 | Reply
I am not a good PHP programmer, but to write a simple web page scraper in PHP is very straightforward.
New York City movers | Feb 2, 2011 | Reply
City move can be and should be an easy process. With Manhattan Movers NYC it start with call to our office! No hidden costs, no surprices! Call us. We serve Manhattan, Brooklyn, Queens – all NYC. Manhattan Movers NYC, 553 Broome St, New York, NY 10013, (212)300-6628
Unhehhmf | May 2, 2012 | Reply
Can you hear me OK? http://eiadecyma.de.tl cartoon teen models I dont get why girls and guys think going down on the girl is nasty now. I think the best-tasting shit on the planet is pussy.
Florian | May 30, 2012 | Reply
very helpful article. thanks a lot!
Xumlikyh | Dec 30, 2012 | Reply
I enjoy travelling
ion cleanse chart | Jan 10, 2013 | Reply
I read this article completely regarding the comparison of most recent
veterinaire laval | Feb 11, 2013 | Reply
This site is awesome. My spouse and i constantly come across a new challenge & diverse in this article.Well I’m glad to have helped. And Nice blog.