Are you looking for a PHP script to extract URLs from webpage? This tutorial will provide a code snippet that will help you to extract all URLs/links from a given website.
Step 1: Create a variable to store the source URL.
$sourceURL="http://example.com";
Step 2: Read source using file_get_contents()
function
$content=file_get_contents($sourceURL);
The file_get_contents() function is used to read source of the given URL into a string. The function is capable of using the memory mapping techniques to improve performance (if supported by the operating system).
You may print or echo $content variable to verify if the source is read properly.
$content=file_get_contents($sourceURL); echo $content
The HTML output of the given URL should be displayed. Now we have HTML output, let’s read anchor tags.
Alternatively, cURL can also be used to retrieve the source of a given URL.
Step 3: Strip all tags except <a>
Use strip_tags()
to strip HTML and PHP tags from a string, but allow only anchor (<a>
) tag. Allowing <a>
tag will help you to retrieve links and anyways other tags are not required.
$content = strip_tags($content,"<a>");
The above line will store stripped contents along with <a>
tags in $content variable.
To understand how strip_tags
work, let’s consider an example string $text that contains <div>
tag and <a>
tag. The below code will strip all <div>
tags and leave <a>
tags untouched.
<?php $text = '<div>Example content to explain strip tags</div><!-- Comment --> <a href="#test">Linked Text</a>'; echo strip_tags($text,'<a>'); ?>
Sample output:
Example content to explain strip tags <a href="#test">Linked Text</a>
So, now $content variable contains stripped contents plus <a>
tags.
Step 4: Use preg_split() to split a string into substrings.
preg_split()
function will split a given string by a regular expression and returns an associative array. The pattern would be <a>
in our case.
$subString = preg_split("/<\/a>/",$content);
Use print_r($subString)
to view the associate array.
Step 5: Loop through associative array and print links
Use foreach
to loop through associative array and search for the occurrences of <a>
and print only links as shown below.
foreach ( $subString as $val ){ if( strpos($val, "<a href=") !== FALSE ){ $val = preg_replace("/.*<a\s+href=\"/sm","",$val); $val = preg_replace("/\".*/","",$val); print $val."\n"; } }
That’s it! Here’s the complete code.
<?php $sourceURL="http://example.com"; $content=file_get_contents($sourceURL); $content = strip_tags($content,"<a>"); $subString = preg_split("/<\/a>/",$content); foreach ( $subString as $val ){ if( strpos($val, "<a href=") !== FALSE ){ $val = preg_replace("/.*<a\s+href=\"/sm","",$val); $val = preg_replace("/\".*/","",$val); print $val."\n"; } } ?>