Articlebase.com scraping tutorial – part 1, getting links under category

Recently I have worked with several web scraping projects. I though I can write my tips so that it comes to usages of others. I am also writing a library for grabbing contents from a few popular article resources like www.articlesnatch.com, www.articlebase.com, www.ezinearticles.com.

Initially I have used simple html dom for traversing the html. It is easy and nice but the script is memory hog. I even sometime would failed to work under 256MB allocated RAM for PHP, specially when you run such traversing in a few (loop) cycles. So, I totally dropped using that and used PHP’s DomDocument.
In my projects I have used cURL for getting contents from remote URL. But here I will show by using simple function file_get_contents().

Getting Articles’ Links under any Category

The category page of article page lists a number of links to articles with a few lines of excerpts. We will fetch the links only.

First of all retrieve contents from remote URL:

//prepare URL

$category = 'Marketing';

$page = 1;

$url = "http://www.articlebase.com/".strtolower($category)."-articles/$page/";

Fetch contents

$html = file_get_contents($url);

Now, initialize our objects.

$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
[/ph[]
Query through our DOM.
$elements =  $xpath->query("//div[@class='article_row']//h3/a");

Make empty array for placing links

if (!is_null($elements)) {
foreach ($elements as $element) {
$links[]=$element->getAttribute(‘href’);
}
} else {
return false;
}

Remove duplicate links from the array

array_unique($links);

That’s it! We now have an array with links. I have wrapped everything for convenience.

function get_category_links($category,$page=1){
$page = intval($page);
$url = "http://www.articlebase.com/".strtolower($category)."-articles/$page/";

$html = file_get_contents($url);
if(!$html) return false;

$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$elements =  $xpath->query("//div[@class='article_row']//h3/a");

$links = array();

if (!is_null($elements)) {
foreach ($elements as $element) {
$links[]=$element->getAttribute('href');
}
} else {
return false;
}

array_unique($links);

return $links;
}