I have been using file_get_html function of Simple HTML DOM parser to parse and extract the contents of a remote website. The script was working well few months ago, but not anymore (file_get_html returns null). Well, after doing some research, I understood that file_get_html() may not work with all remote websites, but cURL should.
Before I explain why file_get_html() does not work, have a look at my PHP script that uses file_get_html() to download the HTML of remote website.
require_once('simple_html_dom.php'); // Create DOM from URL or file $html = file_get_html('https://www.remotesite.com/61224.html'); echo $html;
The output of the above script was NULL. It means, file_get_html didn’t retrieve any HTML content from remotesite.com. In case, if you have used find() function of Simple HTML DOM parse, then would likely end up with below error:
PHP Fatal error: Call to a member function find() on a non-object
Because the find() does not have HTML object to work with, hence fails with the above error message.
So how to fix this issue? Use cURL.
How to use cURL to retrieve HTML of remote website?
$base = 'https://www.remotesite.com/61224.html'; $curl = curl_init(); curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE); curl_setopt($curl, CURLOPT_HEADER, false); curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true); curl_setopt($curl, CURLOPT_URL, $base); curl_setopt($curl, CURLOPT_REFERER, $base); curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE); $str = curl_exec($curl); curl_close($curl); // Create a DOM object $html = new simple_html_dom(); // Load HTML from a string $html->load($str);
Source : Stackoverflow
In the above program, cURL was used to retrieve the HTML of remote website and the output is stored in ‘$str’. Next, create a Simple HTML DOM object and load $str into it. It essentially means, the functionality of file_get_html() has been replaced with a set of cURL functions.
So why does cURL works and file_get_html() does not?
The difference is file_get_html() relies on ‘allow_url_fopen‘ setting in PHP. For security reasons, most of the web hosting providers will disable ‘allow_url_fopen‘ in php.ini, leaving file_get_html to fail. Whereas, cURL does not use ‘allow_url_fopen‘.
Why file_get_html worked before and not now?
Probably, the remote website has just realized that leaving ‘allow_url_fopen‘ to ‘ON‘ is vulnerable?
What is the difference between file_get_html() and cURL?
You already know one difference, but cURL has many advantages. One of the user@stackoverflow says, cURL with many setopt functions allows you to fine tune the request. Have a look the various setopts available with cURL.
The other difference is, traditionally cURL is faster than file_get_html function.
So, what do you think is better? cURL or file_get_html?
Thank you for this! Helped me immensely. Thank you for the detailed explanation into the “why” part as well.