Articlesnatch.com scraping tutorial, getting full article

In my last three tutorials I have discussed how to scrap contents from www.articlebase.com. In this part, I will show how to scrap contents from www.articlesnatch.com. However, unlike the previous tutorials, I will not use DOMDocument in this part. I will not use regular expressions either.

I will show how to get full article. I won’t show how to get articles/links under any category as articlesnatch.com offers feed for each category. So it is easy to get article summary and links of any category. As the feed does not include full text, I will just show how to get it.

Getting Article Body

$html = file_get_contents($link);

We need the contents that is within the div with a class named “KonaBody”. That mean, our target contents are within:

<div class="KonaBody">

......

......

</div>

So, we may remove anything before this div.

$desc = strstr($html,'<div class="KonaBody">');

Continue reading →

Articlebase.com scraping tutorial – part 3, getting full article

In the first part, I have shown how to get links under any category. In the second part, I have shown how to get links for any search term.  In this part, I will show how to fetch a full content.

Le’ts get the html.

$link = ‘artcile_base_article_link’;

$html = file_get_contents($links);

Now, create the objects.

$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

Continue reading →

Articlebase.com scraping tutorial – part 1, getting links under category

Recently I have worked with several web scraping projects. I though I can write my tips so that it comes to usages of others. I am also writing a library for grabbing contents from a few popular article resources like www.articlesnatch.com, www.articlebase.com, www.ezinearticles.com.

Initially I have used simple html dom for traversing the html. It is easy and nice but the script is memory hog. I even sometime would failed to work under 256MB allocated RAM for PHP, specially when you run such traversing in a few (loop) cycles. So, I totally dropped using that and used PHP’s DomDocument.
In my projects I have used cURL for getting contents from remote URL. But here I will show by using simple function file_get_contents().

Getting Articles’ Links under any Category

The category page of article page lists a number of links to articles with a few lines of excerpts. We will fetch the links only.

First of all retrieve contents from remote URL:

//prepare URL

$category = 'Marketing';

$page = 1;

$url = "http://www.articlebase.com/".strtolower($category)."-articles/$page/";

Continue reading →