Follow us on Twitter!
The important thing is not to stop questioning. - Albert Einstein
Monday, April 21, 2014
Navigation
Home
HellBoundHackers Main:
HellBoundHackers Find:
HellBoundHackers Information:
Learn
Communicate
Submit
Shop
Challenges
HellBoundHackers Exploit:
HellBoundHackers Programming:
HellBoundHackers Think:
HellBoundHackers Track:
HellBoundHackers Patch:
HellBoundHackers Other:
HellBoundHackers Need Help?
Other
Members Online
Total Online: 28
Guests Online: 28
Members Online: 0

Registered Members: 82852
Newest Member: sockpuppets
Latest Articles

Using Python's HTMLParser class

Arrow Image A quick intro on parsing web pages in python



Note: To get the most out of this tutorial, make sure you understand OO programming in python, specifically the concept of inheritance.

Python is a very powerful and simple language, which includes many functions to perform complex tasks simply, while still remaining flexible. One such function is called HTMLParser, which as you might have guessed from the name, parses HTML and allows you to easily sort through information in tags. This was also the basis of my google interface, which can be found here:
http://www.hellboundhackers.org/code/readcode.php?id=1145

To save writing more code, I will refer to my banked code as an example. I made mine object oriented, and linked it to HTMLParser as the parent class. The first step in the __init__ is always to initialize the HTMLParser variables, you can do this with HTMLParser.__init__(self).

HTMLParser has several inbuilt methods, but unlike other classes, these are intended to be overwritten by you to suit your purpose. The most used of these is handle_starttag. This method has a standard layout, and it takes 2 arguments from HTMLParser, tag and attrs. tag holds the name of the tag (e.g \'a\',\'p\',\'body\'), and attrs holds a list of values, for example [\'href\',\'www.hellboundhackers.org\'].

You must name the method you create handle_starttag, or it wont work. In the example below, I just wanted all non empty \'a\' tags, then a specific link from that list. Your program may be different, but the principle is the same, you are sorting through the tags.

Once you have written your tag handler, you can feed it some html. This is actually done with a single command, not a method. Once you have set up a request and \'urlopen\'ed it, you will have some html in a variable. You just run self.feed(<variable>), and the html is passed automatically to the handler. Here is the most basic parser possible, so you can see the flow of execution.

Code

from HTMLParser import HTMLParser
from urllib2 import urlopen

class Spider(HTMLParser):
      def __init__(self, url):
            HTMLParser.__init__(self)
            req = urlopen(url)
            self.feed(req.read())

      def handle_starttag(self, tag, attrs):
            if tag == \'a\' and attrs:
                  print \"Found link => %s\" % attrs[0][1]
Spider(\'http://www.hellboundhackers.org\')





If this article helped you, please rate + comment

Comments

stdioon July 25 2008 - 09:39:04
This definetly help me understand the HTMLParser for python
root_opon July 25 2008 - 11:25:36
Simple article but yet very informative, been thinking of learning Python for a while now, this got me more intersted then before. Thanks for a -Very Good- article JJ:happy:
clone4on July 25 2008 - 11:37:24
simple, nice Wink I wish I didn't have learn php now, otherwise I would have definetely start with phyton, it really is great language --Very Good--Wink
ynori7on July 25 2008 - 22:40:39
Nice article, very good rating from me
Neqtanon July 30 2008 - 01:42:04
Nice article JJ
Zephyr_Pureon August 29 2008 - 05:23:13
Damn it... I was just getting into it when it ended abruptly! Great content. In your next Py article, go into more depth and show more variety in examples. Can't wait. Smile
The_Gmanon October 03 2008 - 01:34:50
your example ['href','www.hellboundhackers.org'] should have http:// in it your code is missing tabs. That said, it's very very simple, but it mostly does the job.. you should mention that handle_starttag is called for EVERY tag fed into it with HTMLPArser.feed
Cyph3rHellon October 10 2008 - 15:20:55
Very Nice article!! Grin
Post Comment

Sorry.

You must have completed the challenge Basic 1 and have 100 points or more, to be able to post.