The important thing is not to stop questioning. - Albert Einstein
Thursday, August 28, 2008
Navigation
Donate
Has this website helped you?
px
If so, please donate a little to help out with hosting costs.
Members Online
Total Online: 43
Web Spiders: 3
Guests Online: 30
Members Online: 13

Registered Members: 34608
Newest Member: Babaluno
Most Users online: 523
Latest Articles

Using Python's HTMLParser class


advertisement



website security A quick intro on parsing web pages in python

Note: To get the most out of this tutorial, make sure you understand OO programming in python, specifically the concept of inheritance.

Python is a very powerful and simple language, which includes many functions to perform complex tasks simply, while still remaining flexible. One such function is called HTMLParser, which as you might have guessed from the name, parses HTML and allows you to easily sort through information in tags. This was also the basis of my google interface, which can be found here:
http://www.hellboundhackers.org/code/readcode.php?id=1145

To save writing more code, I will refer to my banked code as an example. I made mine object oriented, and linked it to HTMLParser as the parent class. The first step in the __init__ is always to initialize the HTMLParser variables, you can do this with HTMLParser.__init__(self).

HTMLParser has several inbuilt methods, but unlike other classes, these are intended to be overwritten by you to suit your purpose. The most used of these is handle_starttag. This method has a standard layout, and it takes 2 arguments from HTMLParser, tag and attrs. tag holds the name of the tag (e.g 'a','p','body'), and attrs holds a list of values, for example ['href','www.hellboundhackers.org'].

You must name the method you create handle_starttag, or it wont work. In the example below, I just wanted all non empty 'a' tags, then a specific link from that list. Your program may be different, but the principle is the same, you are sorting through the tags.

Once you have written your tag handler, you can feed it some html. This is actually done with a single command, not a method. Once you have set up a request and 'urlopen'ed it, you will have some html in a variable. You just run self.feed(<variable>), and the html is passed automatically to the handler. Here is the most basic parser possible, so you can see the flow of execution.


from HTMLParser import HTMLParser
from urllib2 import urlopen

class Spider(HTMLParser):
def __init__(self, url):
HTMLParser.__init__(self)
req = urlopen(url)
self.feed(req.read())

def handle_starttag(self, tag, attrs):
if tag == 'a' and attrs:
print "Found link => %s" % attrs[0][1]
Spider('http://www.hellboundhackers.org')





If this article helped you, please rate + comment

Guest
Username

Password

Remember Me


Bookmark This Page
Affiliates
Adverts

 


By using, viewing or obtaining any information contained on this site, you agree to the disclaimer.

© HellBound Hackers 2007- 2008. Since 3rd December 2004.