Monday, June 24, 2013

HTMLParser for small and easy tasks

Python 2.7, Django 1.4

When I just started learning web development, my first task ever was to scrape dozens of web sites. New language, new concepts, new tools. It took  me days to complete the task and I learned how NOT to build a web site. To complete that task I used a web scraping framework known to us by the name of Scrapy. Since than I came to know lxml, Beautiful Soup and HTMLParser. For any extensive web harvesting I use Scrapy, but for some small tasks, HTMLParser is just the thing.

HTMLParser

So, what is HTMLParser and why use it?
HTMLParser is a Python module, so if you have Python installed, you already have it. In itself, HTMLParser does nothing, if you will feed it data, without proper modifications, you will get nothing in return. To make it tick, you need to override the needed methods, and that is what it will do for you. The only thing HTMLParser provides for you is a method to parse X/HTML formatted text, this method is build in and you can't change it.
Before continuing, let's take a look at html tag:
<a href="#">I am a link</a>

First part, that comes before 'I am a link', is a start tag, and that is where all our attributes live.
'I am a link' is the data that this tag holds.
And the last part of a tag </a> is called end tag, most html tags have one and it holds no attributes.

HTMLParser Methods You Have to Override

HANDLE_STARTTAG(SELF, TAG, ATTRS)

This is the method you want to override in most cases and is used for extracting attributes and their data.

HANDLE_ENDTAG(SELF, TAG)

As the name states, handles end tags. Can be used to validate the html.

HANDLE_DATA(SELF, DATA)

This is the method you can use to extract any data from any h, p, text and other tags. For example if you want to extract 'I am a link' in previous example, this is the method you can use:
def handle_data(self, data):
    print data

__INIT__ METHOD

Python documentation doesn't state it, but it is advised to override this method and adapt it to your needs. First parser I wrote didn't work and adding this method solved the matter.

Example

In my case, I needed to extract all href's in given html and validate the links, some relative and some absolute and my task was to check that they all worked.
This is the parser I coded into existence:
from HTMLParser import HTMLParser
import requests
from django.core.urlresolvers import resolve
from django.http import Http404

class MyHTMLParser(HTMLParser):

    def __init__(self, fp):
        """
        fp is an input stream returned by open() or urllib2.urlopen()
        """
        HTMLParser.__init__(self)
        self.seen = {}  # holds parsed hrefs
        self.is_good = True
        self.feed(fp.read())

    def handle_starttag(self, tag, attrs):
        """
        Looking for href attributes and validating them
        """
        for k, v in attrs:
            if k == 'href' and v not in self.seen:
                self.seen[v] = True
                try:
                    resolve(v)
                except Http404:
                    self.is_good = self._check_abs_url(v)
            if not self.is_good:
                return

    def status(self):
        """
        Indicator if all links in current html are working.
        Returns True if no broken links found.
        """
        return self.is_good

    def _check_abs_url(self, url):
        """
        Checks if the link is broken
        """
        try:
            f = request.head(url)
            return True
        except requests.exceptions.RequestException:
            return False

And that is my parser. The only method I override is handle_starttag and __init__. I use a Django, build in, function to validate relative links, and requests for absolute link. One other thing, this parser does a lot of requests, so to make it easier on both servers (the one that does the request and the one that responses) I do head requests.

Wednesday, June 19, 2013

Setting Up Google App Engine WebApp2 Project in Virtualenv

Python 2.7, virtualenv, Ubuntu 12.04, GAE 1.7.1
This is a short HOW-TO for how I solved my ImportError: no module google.appengine.ext while working in virtualenv.
GAE can be downloaded from here.
To install GAE on Linux, just extract the content to where ever you want. I myself use apps/ directory under my /home/usr directory. Which means, in my case, GAE will be found in /home/usr/apps/google_appengine.
After downloading and extracting GAE to said directory, I create a virtualenv for my project.
Project directory structure:
/ProjectA
/bin
/build
/include
/lib
/local
/man
/src
/app
/static
/img
/js
/css
/templates
$ virtualenv ProjectA/ --no-site-packages

I'm running this command from the parent directory of ProjectA. In general you need to specify a full path to the directory that will be the virtual environment.
More info can be found here.
After creating the virtualenv, I create inside ProjectA an src directory structure that will hold all my code. As someone  who comes from Django background, I tend to uphold same architecture in webapp2 (for example, all my models are saved in models.py), as I find it a very good way of coding.
Ok, so far we downloaded our google app engine and created project directory structure.
Lets activate our virtualenv:
$ cd ProjectA/
$ source bin/activate
(ProjectA)$

So now we have a clean development environment with latest python and pip ready and working.
As someone who practice TDD way of coding, my first pip command is:
(ProjectA)$ pip install nose
This will install latest nose framework for testing in python. And here is a catch, nose will search your whole project directory to find tests, but only in package directories or test directory (more info here), so either create a directory that will match testMatch of nose or make your app a package. When you use Django, it is done automatically for you, but in GAE it's not a must for your app to run. Adding __init__.py to your app directory will solve it.
To link the GAE we downloaded before to this virtualenv, add gae.pth to /lib/python2.7/site-packages with following content:
<full path to GAE directory>
<full path to GAE directory>/google
<full path to GAE directory>/lib/antlr3
<full path to GAE directory>/webapp2
<full path to GAE directory>/lib/yaml/lib

That's it. Now you can use commands like 'from goole.appengine.ext import ndb'.
Run a few tests on your code, if you see any more GAE connected ImportError, just add the path to needed module to gae.pth.

Google App Engine and zc.buildout

Setting up a project is like laying out the fundamentals for the house you are going to build. It is just that important. Right project structure can save you from unneeded headaches (have no worries, you will have enough of those anyway :) ).
There is an excellent guide on how to set up a Django project with buildout here.
But I didn't find any such guide on how to set up a GAE project, and after struggling with it for a few hours, this is how I made it work.

First things first - directory structure

project/
   bootstrap.py  †
   buildout.cfg  †
   .installed.cfg
   README
   setup.py
   parts/
   develop-eggs/
   bin/
       buildout
   eggs/
   downloads/
   src/

† put this items under version control
For more info about each part you see read here.
So, first things first, right? I start by creating the project/ directory, with src/ directory under it. src/ is where all of the code resides. I will have another post about how to set up a GAE application, those reside in src/ directory.
$ cd path/to/project/
$ wget http://svn.zope.org/*checkout*/zc.buildout/trunk/bootstrap/bootstrap.py
$ touch setup.py
$ touch buildout.cfg
$ cat  -> buildout.cfg
[buildout]
parts = 
^D
After this we will have a bootstrap.py, buildout.cfg and setup.py alongside src/ directory.

And here we go..

Next thing, let's get us a buildout directory structure:
$ python bootstrap.py
This will create all the other directories (aside form downloads/ which is created later) you can see in the project/ structure.
In next step we need a buildout.cfg and a proper setup.py, so let's create those now.
A word about zc.buildout. This tool allows you to create an isolated development environment, much like virtualenv. In that, you can install any needed python, and it will be installed for this project only. Same goes for other packages.
Open a buildout.cfg in an editor of your choice and add following:
[buildout]
parts = mygaeapp
develop = .

[mygaeapp]
recipe = rod.recipe.appengine
url = http://googleappengine.googlecode.com/files/google_appengine_1.7.2.zip
server-script = dev_appserver
src = ${buildout:directory}/src/mygaeapp
exclude = tests
In my case I use system python. If you want to use a local version of python you can add  it using various recipes on PyPi.
Next let's open setup.py and modify it:
from setuptools import setup, find_packages

install_requires = [
                    'setuptools',
                    'nose',
                    ]
setup(
    name='project',
    version='1.0',
    packages=find_packages('src'),
    package_dir = {'': 'src'},
    install_requires = install_requires,
    url='github link for example',
    license='LISENCE',
    author='Your Name',
    author_email='your@email.com',
    description='Describe your project here or link a readme file'
)
Next thing, run:
$ ./bin/buildout
This will download google app engine from the link you specified in buildout.cfg, install your packages from src/ into parts/ folder, as well as any other dependencies you specified in setup.py and buildout.cfg. This will also add a dev_appserver and appcf.py to bin/ that you need to run the server and upload your application to google app engine.

Attention!

I found out that it's better not to use a global zc.buildout. It caused me a lot of silly errors, and hours of headaches.

Summary

zc.buildout is a nice tools to distribute your code among other collaborators. This way you can give a simple setup and ensure a quick and easy entrance for other developers into your project.

How To Organize a GAE Project

For better or worst, google app engine allows us to use whatever project tree structure we want, as long as we connect all urls with appropriate handlers in app.yaml it will all work just fine.
My first gae app had models.py for all my models (Django school, can't help it :) ) and a separate {view_name}.py for each url. I know, it's ugly, but following gae tutorial lead me to this structure and it all worked ok. First sign of having the wrong kind of project structure was when I've written a custom module to enhance ndb and couldn't make dev_appserver recognize it.
After long hours of digging through Internet and a fair amount of frustration, I finally turned to my mentor for an answer and he pointed me to a certain StackOverflow question regarding same problem.
And here is my final project structure:
project/
   src/
      app1/
         lib/
            __init__.py
            mycustommodule.py
         static/
            img/
            css/
            js/
         templates/
         tests/
            __init__.py
            mytests.py
         __init__.py
         app.yaml
         cron.yaml
         lib_path.py
         index.yaml
         models.py
         urls.py
         views.py
Most of what you see is self-explanetory, I think.
index.yaml is produced by gae when you upload your application to the cloud, do not touch it.
lib_path.py is a simple file adding lib/ to sys.path in such a way that dev_appserver recognizes it:
import os
import sys

sys.path.append(os.path.join(os.path.dirname(__file__), 'lib'))
Before this, I had linked my custom module in site-packages, which allowed me to import my modul in the consol, test and use it just fine, but when I ran dev_appserver it crashed on import error. I went as far as creating a setup.py and doing this didn't help much, and it also didn't feel right.
Now, to use my custom module, I can import it like this anywhere in my application:
from lib import mycustommodule
and woot! it works. How nice and simple indeed.
Lets take a look at urls.py:
import lib_path

import wsgiref.handlers
import webapp2
import views

url_map = [
    ('/', views.MainPage),
]

app = webapp2.WSGIApplication(url_map, debug=True)
Here is where I call for a lib_path and it has to be called first. I don't call it anywhere else.
My urls.py:
application: helloworld
version: 1
runtime: python27
api_version: 1
threadsafe: yes

handlers:
- url: /static
  static_dir: static

- url: /.*
  script: urls.app

libraries:
- name: webapp2
  version: "2.5.1"
- name: jinja2
  version: "2.6"
It is better to use a real version number than 'latest'. This way if one of the functions you were using is not present in the newest version it won't break your code and keep your applicaiton running.
And let's not forget views.py :
import os

import jinja2
import webapp2

from models import *

jinja_environment = jinja2.Environment(
    loader=jinja2.FileSystemLoader(os.path.dirname(__file__)))

class MainPage(webapp2.RedirectHandler):

    def get(self):

        msg = 'Hello World!'

        template_values = {
            'data': msg,
        }
        template = jinja_environment.get_template('/templates/home.html')
        self.response.out.write(template.render(template_values))
Simple, right? ;)
One other thing, I suggest to write all the queries (with filters) in your models.py under @classmethod for following reasons:
  1. Simple to use -  if anyone else works with you on this project, he doesn't have to know how to query, so one less place for errors.
  2. Easy to change - if google will change how BigTable does this query or that, you will have to update your code in one place only.
  3. Easy to maintain.
That's it. Hope it helps you to get started with developing your application and not spending precious time on orginising imports.

Table w. Checkboxes


Django 1.4, Python 2.7


This is one of those things you know all do, but there is no clear post explaining how. Or at least I couldn't find one.

MY GOAL:


Show a table of objects, where every row has a check box, that a user can check, click submit (at the end of the table) and I will receive a list of chosen items to work with.

MY SOLUTION:

In template:
<div>
    <form method="post" action="">
        {% csrf_token %}
        <table>
            <caption><h2>Table w. Checkboxes</h2></caption>
            <thead>
                <tr>
                    <th>Name</th>
                    <th>Activate</th>
                </tr>
            </thead>
            <tbody>
                {% for o in objects %}
                    <tr>
                        <td>{{ o.name }}</td>
                        <td>
                            <input type="checkbox" name="{{ o.id }}">
                        </td>
                    </tr>
                {% endfor %}
            </tbody>
        </table>
    <input type="submit" value="Go!">
    </form>
</div>

In views.py:
def table_w_checkboxes(request):
    if request.method == 'POST':
        for o in request.POST.iterkeys():
            print o # this is the name value of the checkbox
            # do stuff with what ever value you put in the name attr of input
            # tag
            return HttpResponseRedirect(reverse('to-where-ever'))
    else:
            objects = Model.objects.all()

    return render(request, 'table_w_checkboxes.html', {'objects': objects})

First of all, no need for a special form. I did not use any form.
Secondly, see that for loop? That's the part that matters.
request.POST.iterkeys() holds a list of all 'name' attrs of checked check boxes. Meaning, those items that were not checked will not be in this list.
In my case, name attr holds objects' id's. So, o is an id of an object that was checked. Now I can do whatever I need with it. I can do Model.objects.get(id=o), or add the object to another objects relation field - MyrelatedModel.m2mfield.add(o), for example.
Thats it! Now you can show users a table of objects to choose from and to select what they need. If you have any more questions, feel free to comment ;)