Join us at IRC!
Your life is ending one minute at a time. If you were to die tomorrow, what would you do today?
Thursday, May 17, 2012
Navigation
Members Online
Total Online: 30
Web Spiders: 14
Guests Online: 28
Members Online: 2

Registered Members: 70044
Newest Member: acoder11235
Latest Articles

Using Python's HTMLParser class



FLV Blaster - Download Music and Videos Faster

website security A quick intro on parsing web pages in python



Note: To get the most out of this tutorial, make sure you understand OO programming in python, specifically the concept of inheritance.

Python is a very powerful and simple language, which includes many functions to perform complex tasks simply, while still remaining flexible. One such function is called HTMLParser, which as you might have guessed from the name, parses HTML and allows you to easily sort through information in tags. This was also the basis of my google interface, which can be found here:
http://www.hellboundhackers.org/code/readcode.php?id=1145

To save writing more code, I will refer to my banked code as an example. I made mine object oriented, and linked it to HTMLParser as the parent class. The first step in the __init__ is always to initialize the HTMLParser variables, you can do this with HTMLParser.__init__(self).

HTMLParser has several inbuilt methods, but unlike other classes, these are intended to be overwritten by you to suit your purpose. The most used of these is handle_starttag. This method has a standard layout, and it takes 2 arguments from HTMLParser, tag and attrs. tag holds the name of the tag (e.g 'a','p','body'), and attrs holds a list of values, for example ['href','www.hellboundhackers.org'].

You must name the method you create handle_starttag, or it wont work. In the example below, I just wanted all non empty 'a' tags, then a specific link from that list. Your program may be different, but the principle is the same, you are sorting through the tags.

Once you have written your tag handler, you can feed it some html. This is actually done with a single command, not a method. Once you have set up a request and 'urlopen'ed it, you will have some html in a variable. You just run self.feed(<variable>), and the html is passed automatically to the handler. Here is the most basic parser possible, so you can see the flow of execution.


from HTMLParser import HTMLParser
from urllib2 import urlopen

class Spider(HTMLParser):
def __init__(self, url):
HTMLParser.__init__(self)
req = urlopen(url)
self.feed(req.read())

def handle_starttag(self, tag, attrs):
if tag == 'a' and attrs:
print "Found link => %s" % attrs[0][1]
Spider('http://www.hellboundhackers.org')


If this article helped you, please rate + comment

Comments

stdio on July 25 2008 - 09:39:04
This definetly help me understand the HTMLParser for python
root_op on July 25 2008 - 11:25:36
Simple article but yet very informative, been thinking of learning Python for a while now, this got me more intersted then before. Thanks for a -Very Good- article JJ:happy:
clone4 on July 25 2008 - 11:37:24
simple, nice ;) I wish I didn't have learn php now, otherwise I would have definetely start with phyton, it really is great language --Very Good--;)
ynori7 on July 25 2008 - 22:40:39
Nice article, very good rating from me
Neqtan on July 30 2008 - 01:42:04
Nice article JJ
Zephyr_Pure on August 29 2008 - 05:23:13
Damn it... I was just getting into it when it ended abruptly! Great content. In your next Py article, go into more depth and show more variety in examples. Can't wait. :)
The_Gman on October 03 2008 - 01:34:50
your example ['href','www.hellboundhackers.org'] should have http:// in it your code is missing tabs. That said, it's very very simple, but it mostly does the job.. you should mention that handle_starttag is called for EVERY tag fed into it with HTMLPArser.feed
Cyph3rHell on October 10 2008 - 15:20:55
Very Nice article!! :D
Post Comment

Sorry.

You must have completed the challenge Basic 1 and have 100 points or more, to be able to post.
Ratings
Rating is available to members only.

Please login or register to vote.

Awesome! 9% [1 Vote]
Very Good 73% [8 Votes]
Good 18% [2 Votes]
Average 0% [No Votes]
Poor 0% [No Votes]
Guest
Username

Password

Remember Me


Bookmark This Page
Affiliates
Adverts

 

 

Links
By using, viewing or obtaining any information contained on this site, you agree to the disclaimer.

© HellBound Hackers 2008- 2009. Since 3rd December 2004.