Recently I have worked with several web scraping projects. I though I can write my tips so that it comes to usages of others. I am also writing a library for grabbing contents from a few popular article resources like www.articlesnatch.com, www.articlebase.com, www.ezinearticles.com.
Initially I have used simple html dom for traversing the html. It is easy and nice but the script is memory hog. I even sometime would failed to work under 256MB allocated RAM for PHP, specially when you run such traversing in a few (loop) cycles. So, I totally dropped using that and used PHP’s DomDocument.
In my projects I have used cURL for getting contents from remote URL. But here I will show by using simple function file_get_contents().
Getting Articles’ Links under any Category
The category page of article page lists a number of links to articles with a few lines of excerpts. We will fetch the links only.
First of all retrieve contents from remote URL:
//prepare URL $category = 'Marketing'; $page = 1; $url = "http://www.articlebase.com/".strtolower($category)."-articles/$page/";
Fetch contents
$html = file_get_contents($url);
Now, initialize our objects.
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
[/ph[]
Query through our DOM.
1$elements = $xpath->query("//div[@class='article_row']//h3/a");
Make empty array for placing links
if (!is_null($elements)) {
foreach ($elements as $element) {
$links[]=$element->getAttribute(‘href’);
}
} else {
return false;
}
Remove duplicate links from the array
array_unique($links);
That’s it! We now have an array with links. I have wrapped everything for convenience.
function get_category_links($category,$page=1){
$page = intval($page);
$url = "http://www.articlebase.com/".strtolower($category)."-articles/$page/";
$html = file_get_contents($url);
if(!$html) return false;
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$elements = $xpath->query("//div[@class='article_row']//h3/a");
$links = array();
if (!is_null($elements)) {
foreach ($elements as $element) {
$links[]=$element->getAttribute('href');
}
} else {
return false;
}
array_unique($links);
return $links;
}

Pingback: Building a WordPress AutoBlogger | MengelIT