Articlebase.com scraping tutorial – part 2, getting search links

In the first part, I have shown how to get links under any category. Now, we will get links when you search articlebase.com with a search term.

Getting HTML

$keyword = ‘beauty’;

$page = intval($page);
$url = “http://www.articlesbase.com/find-articles.php?q=”.strtolower(urlencode($keyword)).”&page=”.urlencode($page);

$html = file_get_contents($url);
if(!$html) return false;

Initialize objects

$dom = new DOMDocument();
@$dom->loadHTML($html);
$dom = new DOMXPath($dom);

Query through DOM

$elements =  $dom->query(“//div[@class=’article_row_middle’]//h3/a”);

Make an empty array to hold links

$links = array();

Find all links and put in the array

if (!is_null($elements)) {
foreach ($elements as $element) {
$links[]=$element->getAttribute(‘href’);
}
} else {
return false;
}

That’s it! Simply and easy! Isn’t it? Again putting them all together.

function get_search_links($keyword,$page=1){
$page = intval($page);
$url = “http://www.articlesbase.com/find-articles.php?q=”.strtolower(urlencode($keyword)).”&page=”.urlencode($page);

$html = abs_init_curl($url);
if(!$html) return false;

$dom = new DOMDocument();
@$dom->loadHTML($html);
$dom = new DOMXPath($dom);
$elements =  $dom->query(“//div[@class=’article_row_middle’]//h3/a”);

$links = array();

if (!is_null($elements)) {
foreach ($elements as $element) {
$links[]=$element->getAttribute(‘href’);
}
} else {
return false;
}
unset($element,$elements);
return $links;
}

  • kc

    Hi ,

    I am a new programmer working with wampserver 2.0 and eclipse pdt.

    I tried to run your function above but am getting an error which I don’t understand

    in the line:

    $url = “http://www.articlesbase.com/find-articles.php?q=”.strtolower(urlencode($keyword)).”&page=”.urlencode($page);

    there is a marker under the ‘:’

    I don’t know what to do, since this seems fine.

    Thanks,

    KC

  • kc

    p.s the error in eclipse is:

    Parse error: parse error in C:\wamp\www\articlebase\ab1.php on line 11

    the entire file is:

    loadHTML($html);
    $dom = new DOMXPath($dom);
    $elements = $dom->query(“//div[@class=’article_row_middle’]//h3/a”);

    $links = array();

    if (!is_null($elements)) {
    foreach ($elements as $element) {
    $links[]=$element->getAttribute(‘href’);
    }
    } else {
    return false;
    }

    ?>

  • The HungryCoder

    Hello
    Thanks for your time to comment! Honestly, the given codes are targeted the PHP programmers with at least basic PHP syntax knowledge! If you are not familiar with PHP syntax, you may find it tough to understand and exercise the codes!

    I guess you might have facing problem with quotation mark! Sometimes single quotation (‘) and double quotations (“) are changed to something little curly quotations! Those are not known by PHP! So, first make sure the quotations are correct! Otherwise, you may send me the file to check! Pasting the codes here sometimes get formatted differently!

  • KC

    Thank you for your response Hungry Coder.

    I was able to get that working after making the substitutions you suggested. Now I am having a problem with the line:

    $html = abs_init_curl($url);

    I have found the abs() function ( absolute value ) and the init_ curl command for curl
    but I am not sure about “abs_init_curl”. I’m sorry This is surely a very basic question.

    My regards,

    KC

  • The HungryCoder

    hello KC,
    the abs_init_curl is a separate function to initiate a cURL request to articlebase.com! If you are unsure how to initiate a cURL request, you can use the following code:

    function abs_init_curl($url){
    // make the cURL request to $url
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL,$url);
    //curl_setopt($ch, CURLOPT_FAILONERROR, true);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
    @curl_setopt($ch, CURLOPT_FOLLOWLOCATION,true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 25);
    $html = curl_exec($ch);
    $info = curl_getinfo($ch);
    //echo '

    ';
    	//print_r($info);
    	//echo '

    ';
    if (!$html) {
    echo "cURL error:" . curl_error($ch);
    echo "cURL URL:" . $info['url'];
    }
    //echo 'cURL HTTP Code: '.$info['http_code'];
    curl_close($ch);
    return $html;
    }

    BTW, abs() function is totally different from this one. Here abs_ prefix is my own namespace style function naming that I always do with wordpress to avoid unexpected collision with same function name! These prefix are, usually, the abbreviation of the plugin I am dealing with!

    Thanks

  • kc

    Thanks again Hungry Coder. That fixed it! – KC