In my last three tutorials I have discussed how to scrap contents from www.articlebase.com. In this part, I will show how to scrap contents from www.articlesnatch.com. However, unlike the previous tutorials, I will not use DOMDocument in this part. I will not use regular expressions either.
I will show how to get full article. I won’t show how to get articles/links under any category as articlesnatch.com offers feed for each category. So it is easy to get article summary and links of any category. As the feed does not include full text, I will just show how to get it.
Getting Article Body
$html = file_get_contents($link);
We need the contents that is within the div with a class named “KonaBody”. That mean, our target contents are within:
PHP 5 includes a great built in class DOMDocument to DOM parsing of HTML/XML document. The class includes a number of methods to easily traversing a DOM.
However, it has a few shortfalls like it fails to handle encoding correctly and includes some tags which may often seem irritating for developers.
Artem Russakovskii has made an extension (named SmartDOMDocument) of this class to eliminate such shortcomings. His class inherits the built in DOMDocument and includes a few extra methods that may make developers life peaceful.
Extra Features of SmartDOMDocument
DOMDocument has an extremely badly designed “feature” where if the HTML code you are loading does not contain <html> and <body> tags, it adds them automatically (yup, there are no flags to turn this behavior off).
Thus, when you call $doc->saveHTML(), your newly saved content now has <html><body> and DOCTYPE in it. Not very handy when trying to work with code fragments (XML has a similar problem).
SmartDOMDocument contains a new function called saveHTMLExact() which does exactly what you would want – it saves HTML without adding that extra garbage that DOMDocument does.
DOMDocument notoriously doesn’t handle encoding (at least UTF-8) correctly and garbles the output.
SmartDOMDocument tries to work around this problem by enhancing loadHTML() to deal with encoding correctly. This behavior is transparent to you – just use loadHTML() as you would normally.
Recently I have worked with several web scraping projects. I though I can write my tips so that it comes to usages of others. I am also writing a library for grabbing contents from a few popular article resources like www.articlesnatch.com, www.articlebase.com, www.ezinearticles.com.
Initially I have used simple html dom for traversing the html. It is easy and nice but the script is memory hog. I even sometime would failed to work under 256MB allocated RAM for PHP, specially when you run such traversing in a few (loop) cycles. So, I totally dropped using that and used PHP’s DomDocument.
In my projects I have used cURL for getting contents from remote URL. But here I will show by using simple function file_get_contents().
Getting Articles’ Links under any Category
The category page of article page lists a number of links to articles with a few lines of excerpts. We will fetch the links only.
If you are looking for creating PDF output of your HTML pages, you can do it using PHP’s PDFlib module. However, there are some libraries for faster development of your script. Some even are not dependent on PDFlib. Please see below for quick overview.
dompdf is an HTML to PDF converter. At its heart, dompdf is (mostly) CSS2.1 compliant HTML layout and rendering engine written in PHP. It is a style-driven renderer: it will download and read external stylesheets, inline style tags, and the style attributes of individual HTML elements. It also supports most presentational HTML attributes.
* handles most CSS2.1 properties, including @import, @media & @page rules
* supports most presentational HTML 4.0 attributes
* supports external stylesheets, either local or through http/ftp (via fopen-wrappers)
* supports complex tables, including row & column spans, separate & collapsed border models, individual cell styling, (no nested tables yet however)
* image support (gif, png & jpeg)
* no dependencies on external PDF libraries, thanks to the R&OS PDF class
* inline PHP support. See below for details. Continue reading →