PHP Script to Extract URLs from Webpage

Updated on November 6, 2017

Are you looking for a PHP script to extract URLs from webpage? This tutorial will provide a code snippet that will help you to extract all URLs/links from a given website.

Step 1: Create a variable to store the source URL.

$sourceURL="http://example.com";
Note:

Replace example.com with the URL you wish to extract links from.

Step 2: Read source using file_get_contents() function

$content=file_get_contents($sourceURL);

The file_get_contents() function is used to read source of the given URL into a string. The function is capable of using the memory mapping techniques to improve performance (if supported by the operating system).

You may print or echo $content variable to verify if the source is read properly.

$content=file_get_contents($sourceURL);
echo $content

The HTML output of the given URL should be displayed. Now we have HTML output, let’s read anchor tags.

Alternatively, cURL can also be used to retrieve the source of a given URL.

Step 3: Strip all tags except <a>

Use strip_tags() to strip HTML and PHP tags from a string, but allow only anchor (<a>) tag. Allowing <a> tag will help you to retrieve links and anyways other tags are not required.

$content = strip_tags($content,"<a>");

The above line will store stripped contents along with <a> tags in $content variable.

To understand how strip_tags work, let’s consider an example string $text that contains <div> tag and <a> tag. The below code will strip all <div> tags and leave <a> tags untouched.

<?php
$text = '<div>Example content to explain strip tags</div><!-- Comment --> <a href="#test">Linked Text</a>';
echo strip_tags($text,'<a>');
?>

Sample output:

Example content to explain strip tags <a href="#test">Linked Text</a>

So, now $content variable contains stripped contents plus <a> tags.

Step 4: Use preg_split() to split a string into substrings.

preg_split() function will split a given string by a regular expression and returns an associative array. The pattern would be <a> in our case.

$subString = preg_split("/<\/a>/",$content);
Note:

Escape slashes wherever necessary.

Use print_r($subString) to view the associate array.

Use foreach to loop through associative array and search for the occurrences of <a> and print only links as shown below.

foreach ( $subString as $val ){
 if( strpos($val, "<a href=") !== FALSE ){
 $val = preg_replace("/.*<a\s+href=\"/sm","",$val);
 $val = preg_replace("/\".*/","",$val);
 print $val."\n";
 }
}
Info:

preg_replace will search for a given pattern and replace with given value.

That’s it! Here’s the complete code.

<?php
$sourceURL="http://example.com";
$content=file_get_contents($sourceURL);
$content = strip_tags($content,"<a>");

$subString = preg_split("/<\/a>/",$content);
foreach ( $subString as $val ){
 if( strpos($val, "<a href=") !== FALSE ){
 $val = preg_replace("/.*<a\s+href=\"/sm","",$val);
 $val = preg_replace("/\".*/","",$val);
 print $val."\n";
 }
}
?>

Download the PHP code here.

Was this article helpful?

Related Articles

Leave a Comment