Articlebase.com scraping tutorial – part 3, getting full article

In the first part, I have shown how to get links under any category. In the second part, I have shown how to get links for any search term.  In this part, I will show how to fetch a full content.

Le’ts get the html.

$link = ‘artcile_base_article_link’;

$html = file_get_contents($links);

Now, create the objects.

$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

Create an empty array to hold our findings

$result = array();

We will first get the Article Title

$elements = $xpath->query(“//div[@class=’article_pg’]//h1”);
$element = $elements->item(0);
$result[‘title’]=strip_tags($element->nodeValue);

We have received our article title. Now we will go for article body.

$elements =  $xpath->query(“//div[@class=’article_cnt KonaBody’]”);
$element = $elements->item(0);
$result[‘body’] = $dom->saveXML($element);

You may now clean the html’s if you want as it may contain some site specific attributes. I am not going to show that here.

Getting article’s author bio.

//get author bio
$result[‘author_bio’]=”;
$xpath = new DOMXPath($dom);
$elements = $xpath->query(“//div[@class=’author_details’]/p”);

for ($i = 0;  $i < $elements->length; $i++ ) {  //$paras->length
$element = $elements->item($i);
$result[‘author_bio’] .= $dom->saveXml($element);
}

Now, we got all required things of our article. You may now process it further as you need. The full code looks like:

function get_article($url);
$html = file_get_contents($url);
if(!$html) return false;

$result = array();
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

//get title
$elements =  $xpath->query(“//div[@class=’article_pg’]//h1”);
$element = $elements->item(0);
$result[‘title’]=strip_tags($element->nodeValue);

//get body
$elements =  $xpath->query(“//div[@class=’article_cnt KonaBody’]”);
$element = $elements->item(0);
$result[‘body’] = $dom->saveXML($element);

//get author bio
$result[‘author_bio’]=”;
$xpath = new DOMXPath($dom);
$elements = $xpath->query(“//div[@class=’author_details’]/p”);

for ($i = 0;  $i < $elements->length; $i++ ) {  //$paras->length
$element = $elements->item($i);
$result[‘author_bio’] .= $dom->saveXml($element);
}

return $result;
}

  • kc

    Hi Hungry Coder,

    With your help, I was able to get the getting search links in part 2 working.

    I am having a problem with the script above now though. I think I have it working up to
    the lines:

    //get title
    34 $elements = $xpath->query(“//div[@class=’article_pg’]//h1″);
    35 $element = $elements->item(0);
    36 $result[‘title’]=strip_tags($element->nodeValue);

    but when I try to append : ” echo($result[‘title’]); ” to this I get the error

    Notice: Trying to get property of non-object in C:\wamp\www\articlebase\ab3.php on line 36

  • KC

    Hi Hungry Coder,

    I have a follow up question if you don’t mind

    I have been reading up on the DOM object and the DOMXPath object. But what I am less sure of is how you figured out that the title location is //div[@class=’article_pg’]//h1 and the body is at //div[@class=’article_cnt KonaBody’] . When I look at the articlebase source HTML, its not clear to me. Can you tell me your approach to figuring out what you wish to scrape?

    My Regards,

    KC

  • The HungryCoder

    Beside reading DOM, you can read on XPATH on http://www.w3schools.com.

    For easily finding xpath, you can use FireFinder addon for Firebug! I hope you know Firebug!

    The URL for FireFinder is https://addons.mozilla.org/en-US/firefox/addon/11905/

    Thanks 🙂

  • It seems I’m pulling more then I should when using this source. Has Articlebase changed the design of the page ?

  • The HungryCoder

    Yep, they have changed their css classes. One of my old client also requested me for the update for his autoblogging script that I made for him. I will post update soon.

  • I managed to fix it. Change every instance of
    class=article_cnt KonaBody
    To Just
    class=KonaBody
    It works fine..l8r

  • The HungryCoder

    Yep, thanks. I have also got it working. Just forgot to post here!

    however, it took me a while to resolve as most probably my client renamed http://www.articlesbase.com to articlebase.com which I was almost overlooking all time! 🙁

  • Pingback: Building a WordPress AutoBlogger | MengelIT()