Monday, June 24, 2013

HTMLParser for small and easy tasks

Python 2.7, Django 1.4

When I just started learning web development, my first task ever was to scrape dozens of web sites. New language, new concepts, new tools. It took  me days to complete the task and I learned how NOT to build a web site. To complete that task I used a web scraping framework known to us by the name of Scrapy. Since than I came to know lxml, Beautiful Soup and HTMLParser. For any extensive web harvesting I use Scrapy, but for some small tasks, HTMLParser is just the thing.

HTMLParser

So, what is HTMLParser and why use it?
HTMLParser is a Python module, so if you have Python installed, you already have it. In itself, HTMLParser does nothing, if you will feed it data, without proper modifications, you will get nothing in return. To make it tick, you need to override the needed methods, and that is what it will do for you. The only thing HTMLParser provides for you is a method to parse X/HTML formatted text, this method is build in and you can't change it.
Before continuing, let's take a look at html tag:
<a href="#">I am a link</a>

First part, that comes before 'I am a link', is a start tag, and that is where all our attributes live.
'I am a link' is the data that this tag holds.
And the last part of a tag </a> is called end tag, most html tags have one and it holds no attributes.

HTMLParser Methods You Have to Override

HANDLE_STARTTAG(SELF, TAG, ATTRS)

This is the method you want to override in most cases and is used for extracting attributes and their data.

HANDLE_ENDTAG(SELF, TAG)

As the name states, handles end tags. Can be used to validate the html.

HANDLE_DATA(SELF, DATA)

This is the method you can use to extract any data from any h, p, text and other tags. For example if you want to extract 'I am a link' in previous example, this is the method you can use:
def handle_data(self, data):
    print data

__INIT__ METHOD

Python documentation doesn't state it, but it is advised to override this method and adapt it to your needs. First parser I wrote didn't work and adding this method solved the matter.

Example

In my case, I needed to extract all href's in given html and validate the links, some relative and some absolute and my task was to check that they all worked.
This is the parser I coded into existence:
from HTMLParser import HTMLParser
import requests
from django.core.urlresolvers import resolve
from django.http import Http404

class MyHTMLParser(HTMLParser):

    def __init__(self, fp):
        """
        fp is an input stream returned by open() or urllib2.urlopen()
        """
        HTMLParser.__init__(self)
        self.seen = {}  # holds parsed hrefs
        self.is_good = True
        self.feed(fp.read())

    def handle_starttag(self, tag, attrs):
        """
        Looking for href attributes and validating them
        """
        for k, v in attrs:
            if k == 'href' and v not in self.seen:
                self.seen[v] = True
                try:
                    resolve(v)
                except Http404:
                    self.is_good = self._check_abs_url(v)
            if not self.is_good:
                return

    def status(self):
        """
        Indicator if all links in current html are working.
        Returns True if no broken links found.
        """
        return self.is_good

    def _check_abs_url(self, url):
        """
        Checks if the link is broken
        """
        try:
            f = request.head(url)
            return True
        except requests.exceptions.RequestException:
            return False

And that is my parser. The only method I override is handle_starttag and __init__. I use a Django, build in, function to validate relative links, and requests for absolute link. One other thing, this parser does a lot of requests, so to make it easier on both servers (the one that does the request and the one that responses) I do head requests.

No comments:

Post a Comment