RSS Feed for This PostCurrent Article

PHP: Write a Web Page Scraper

Download Sample Code

I have not done many PHP projects but I always think that PHP is really good for web programming even though I mostly program in Java and .NET.

I am not a good PHP programmer, but to write a simple web page scraper in PHP is very straightforward.

First, I defined my HTTP get/post function

function http_post_content($url,$data) {
        $data = http_build_query($data);
        $aContext = array(
                'http' => array(
                'proxy' => 'proxyserver:8080', 
                'request_fulluri' => True, 
                'method'=>'POST',
                'header'=> 
                "Content-type:  application/x-www-form-urlencoded\r\n"
                ."Content-Length: " . strlen($data) . "\r\n",
                'content' => $data
                                ),        
                );    
        $cxContext = stream_context_create($aContext);
        $content = file_get_contents($url, FILE_TEXT, $cxContext);
        return $content;
        
        /*
        $fp = @fopen($url, 'rb', false, $cxContext);
        if (!$fp) {
            throw new Exception("Problem with $url, $php_errormsg");
        }
        $content = @stream_get_contents($fp);
        if ($content === false) {
                throw new Exception("Problem reading data 
                from $url, $php_errormsg");
        }
        */
}

In the http array I can define the proxy server settings, method which can be post or get, and the content.

I use file_get_contents to retrieve the web page instead of cURL, which is more powerful.

To test it,


$url="http://biz.thestar.com.my/marketwatch/main.asp";
$data=array('bns'=>'2',
                'clp'=>'1',
                'klseViewDate'=>'1/30/2008'
                );
$content = http_post_content($url,$data);
$htmlDoc = new DomDocument();
@$htmlDoc->loadHTML($content);
$xPath = new DOMXPath($htmlDoc);
$counters = $xPath->evaluate('//table[@id="Table2"]
/tr/td[2]/center/table/tr/td/span
[@class="text"]/table/tr/td/a');
for ($i = 0; $i < $counters->length; $i++){
        print($counters->item($i)->nodeValue.'
'); }

I pass in the URL and the post data, then I use XQuery to retrieve the information that I want.

Note the HTML may not be well-formed, so I suppress the warning by prefixing a @ in front.

One of the catch here is the correct XQuery/XPath to be used. You can always find out using Solvent, which is a very good FireFox plug-in for web page scraping.


Trackback URL


RSS Feed for This Post23 Comment(s)

  1. Ruben Zevallos Jr. | Mar 8, 2008 | Reply

    Very nice article… I did my Web Scraping, but it have to navigate and get info from lot’s of web sites… I’m looking for some way to do it without to much effort…

    Thank you for you ideas…

  2. Caio Iglesias | Jun 18, 2008 | Reply

    Running your code exactly as it is doesn’t work on my server. Any ideas?

  3. admin | Jun 18, 2008 | Reply

    The website layout is changed. The XQuery has to be modified accordingly.

  4. stevo | Dec 29, 2008 | Reply

    I’m a bit confused on this. A big part of the article focuses on getting the web page, which IMO is simple (Even an fopen($url) should suffice).

    But I am a bit confused on how you are extracting/scraping the data. Can you please provide more information into Xpath, how you used it in PHP, and how it works in general? I tried wikipedia without too much luck (on xQuery and xPath)
    thanks

  5. glen | Dec 16, 2009 | Reply

    Hi
    I am looking for some way of pulling dynamic content off a web page and displaying it in a basic table format on a webpage. We have over 100 remote locations with a printer/NAS/scanner/WAP etc and each device has a web interface with a “status”. This status is dynamic(obviously) and I have a simple html page with 100 odd rows, and 10 columns displaying the status of each device at each location.
    I started using iFrames to display the status, which works, but is very time consuming to load the page as it is effectively loading 100×10 pages.
    Is there some (simple) method of pulling this dynaic part from each page to the ‘cell’ on my status page?

    cheers

    Glen

  6. asdasdas | Aug 5, 2010 | Reply

    asdasa

  7. someone | Aug 5, 2010 | Reply

    Thanks for the helpful input asdasdas, better than what qwerty said.
    cheers
    someone

  8. vignesh | Aug 11, 2010 | Reply

    i dont know about php , can u explain php from basic

  9. Unterseer | Sep 4, 2010 | Reply

    Running your code exactly as it is doesn’t work on my server. Any ideas?Poloshirts

  10. FengShui | Sep 12, 2010 | Reply

    Thanks for the helpful input asdasdas, better than what qwerty said.Feng Shui in der Praxis
    Feng Shui
    Feng Shui Regeln

  11. Junggesellenabschied | Sep 24, 2010 | Reply

    Very nice article…

  12. reiten | Sep 24, 2010 | Reply

    Reitforum

  13. Fun T-Shirt Sprüche | Sep 24, 2010 | Reply

    Great article. The code works fine for me.

    Thanks

  14. sağlık | Oct 21, 2010 | Reply

    Is not nothin much to say about this page. I think a very successful and useful.

  15. Fussbodenheizung | Nov 17, 2010 | Reply

    i was looking for php in the internet for getting started and found your site accidently. It helped me a lot, thanks.

  16. Praveen | Dec 16, 2010 | Reply

    Hi,

    Can you help me about how can i Scrap “http://www.indiatimes.com/” Site Latest News Slideshow Deta on our site with php.

    Regards,
    Praveen

  17. approved payday loan | Jan 13, 2011 | Reply

    I am not a good PHP programmer, but to write a simple web page scraper in PHP is very straightforward.

  18. New York City movers | Feb 2, 2011 | Reply

    City move can be and should be an easy process. With Manhattan Movers NYC it start with call to our office! No hidden costs, no surprices! Call us. We serve Manhattan, Brooklyn, Queens – all NYC. Manhattan Movers NYC, 553 Broome St, New York, NY 10013, (212)300-6628

  19. Unhehhmf | May 2, 2012 | Reply

    Can you hear me OK? http://eiadecyma.de.tl cartoon teen models I dont get why girls and guys think going down on the girl is nasty now. I think the best-tasting shit on the planet is pussy.

  20. Florian | May 30, 2012 | Reply

    very helpful article. thanks a lot!

  21. Xumlikyh | Dec 30, 2012 | Reply

    I enjoy travelling

  22. ion cleanse chart | Jan 10, 2013 | Reply

    I read this article completely regarding the comparison of most recent

  23. veterinaire laval | Feb 11, 2013 | Reply

    This site is awesome. My spouse and i constantly come across a new challenge & diverse in this article.Well I’m glad to have helped. And Nice blog.

1 Trackback(s)

  1. From Electron Soup : Learning page scraping and mashups | Jul 23, 2008

RSS Feed for This PostPost a Comment