Python 2.7, Django 1.4
When I just started learning web development, my first task ever was to scrape dozens of web sites. New language, new concepts, new tools. It took me days to complete the task and I learned how NOT to build a web site. To complete that task I used a web scraping framework known to us by the name of Scrapy. Since than I came to know lxml, Beautiful Soup and HTMLParser. For any extensive web harvesting I use Scrapy, but for some small tasks, HTMLParser is just the thing.
HTMLParser
So, what is HTMLParser and why use it?
HTMLParser is a Python module, so if you have Python installed, you already have it. In itself, HTMLParser does nothing, if you will feed it data, without proper modifications, you will get nothing in return. To make it tick, you need to override the needed methods, and that is what it will do for you. The only thing HTMLParser provides for you is a method to parse X/HTML formatted text, this method is build in and you can't change it.
Before continuing, let's take a look at html tag:
<a href="#">I am a link</a>
First part, that comes before 'I am a link', is a start tag, and that is where all our attributes live.
'I am a link' is the data that this tag holds.
And the last part of a tag </a> is called end tag, most html tags have one and it holds no attributes.
HTMLParser Methods You Have to Override
HANDLE_STARTTAG(SELF, TAG, ATTRS)
This is the method you want to override in most cases and is used for extracting attributes and their data.
HANDLE_ENDTAG(SELF, TAG)
As the name states, handles end tags. Can be used to validate the html.
HANDLE_DATA(SELF, DATA)
This is the method you can use to extract any data from any h, p, text and other tags. For example if you want to extract 'I am a link' in previous example, this is the method you can use:
def handle_data(self, data): print data
__INIT__ METHOD
Python documentation doesn't state it, but it is advised to override this method and adapt it to your needs. First parser I wrote didn't work and adding this method solved the matter.
Example
In my case, I needed to extract all href's in given html and validate the links, some relative and some absolute and my task was to check that they all worked.
This is the parser I coded into existence:
from HTMLParser import HTMLParser import requests from django.core.urlresolvers import resolve from django.http import Http404 class MyHTMLParser(HTMLParser): def __init__(self, fp): """ fp is an input stream returned by open() or urllib2.urlopen() """ HTMLParser.__init__(self) self.seen = {} # holds parsed hrefs self.is_good = True self.feed(fp.read()) def handle_starttag(self, tag, attrs): """ Looking for href attributes and validating them """ for k, v in attrs: if k == 'href' and v not in self.seen: self.seen[v] = True try: resolve(v) except Http404: self.is_good = self._check_abs_url(v) if not self.is_good: return def status(self): """ Indicator if all links in current html are working. Returns True if no broken links found. """ return self.is_good def _check_abs_url(self, url): """ Checks if the link is broken """ try: f = request.head(url) return True except requests.exceptions.RequestException: return False
And that is my parser. The only method I override is handle_starttag and __init__. I use a Django, build in, function to validate relative links, and requests for absolute link. One other thing, this parser does a lot of requests, so to make it easier on both servers (the one that does the request and the one that responses) I do head requests.