Recently I have worked with several web scraping projects. I though I can write my tips so that it comes to usages of others. I am also writing a library for grabbing contents from a few popular article resources like www.articlesnatch.com, www.articlebase.com, www.ezinearticles.com.
Initially I have used simple html dom for traversing the html. It is easy and nice but the script is memory hog. I even sometime would failed to work under 256MB allocated RAM for PHP, specially when you run such traversing in a few (loop) cycles. So, I totally dropped using that and used PHP’s DomDocument.
In my projects I have used cURL for getting contents from remote URL. But here I will show by using simple function file_get_contents().
Getting Articles’ Links under any Category
The category page of article page lists a number of links to articles with a few lines of excerpts. We will fetch the links only.
First of all retrieve contents from remote URL:
$category = 'Marketing';
$page = 1;
$url = "http://www.articlebase.com/".strtolower($category)."-articles/$page/";
Continue reading →
My first scraping work was www.stock.projanmo.com where I have fetched and processed stock data from www.dsebd.org and www.biasl.net. I had to scrap them as they did not have any syndication feed. I had to process line by line. That was tedious job.
Later, I have worked with eBay product scraping for a few of my clients. In many cases, I did not need to take much trouble as they have web services. Whatever, that was most boring tasks as I am not good at Regular Expression. So, I have denied a lots of such tasks.
Recently, one of my old customer requested me to work again on scraping for collecting articles from www.articlesnatch.com and auto blog in wordpress. It also was comparatively easy as it has RSS feed for search page. But the RSS had summary of article. I had to fetch the whole article.
Yesterday, I have started a pretty big scrapping project. I also took helping hands to complete it fast. This time, I had to scrap articles from www.articlebase.com and autoblog in wordpress on some preselected schedules (wordpress’s native cron). As they don’t have any feed for search keyword/category, it is a bit complex comparing to previous one. However, as I already have gain some scraping experience, it was very easy for me. And most surprisingly, I am now getting interest on scraping :P.