scraping tutorial, getting full article

In my last three tutorials I have discussed how to scrap contents from In this part, I will show how to scrap contents from However, unlike the previous tutorials, I will not use DOMDocument in this part. I will not use regular expressions either.

I will show how to get full article. I won’t show how to get articles/links under any category as offers feed for each category. So it is easy to get article summary and links of any category. As the feed does not include full text, I will just show how to get it.

Getting Article Body

$html = file_get_contents($link);

We need the contents that is within the div with a class named “KonaBody”. That mean, our target contents are within:

<div class="KonaBody">




So, we may remove anything before this div.

$desc = strstr($html,'<div class="KonaBody">');

We don’t need anything after this div. Let’s get the position of

$end = stripos($desc,'</div>');

Now extract anything inside

$desc = substr($desc,0,$end);

If you review the html of $desc, you will see we don’t have </div> (closing tag) inside our code. So, we need to remove the opening of this div too.

$desc = str_ireplace( '<div class="KonaBody">','',$desc);

Now, we have contents within <div class=”KonaBody”></div>. This is the main article. Now we may further process it. You may remove any html tags or include extra tags or replace any tags. That mean, you may now process $desc as you want. To remove all html tags except anchors,

$desc = strip_tags($desc,'<a>');

Getting Author

Fetching author paragraph

$author = strstr($html,'<p><b>About the Author:</b>');
$end = stripos($author,'</p>');
$author = substr($author,0,$end);

Clean the Distributed by texts

$pos = stripos($author,'Distributed by');
if($pos !== false){
$author = substr($author,0,$pos);

You may also replace it by using str_replace function.

That’s it. You now have full article body in $desc variable and author detail in $author variable.

You may get more portable result if you do this same using regular expression. However, this can be good starts for the newbies.

Have fun!