|You are here: Home > Dive Into Python > HTML Processing > Introducing sgmllib.py||<< >>|
Python from novice to pro
HTML processing is broken into three steps: breaking down the HTML into its constituent pieces, fiddling with the pieces, and reconstructing the pieces into HTML again. The first step is done by sgmllib.py, a part of the standard Python library.
The key to understanding this chapter is to realize that HTML is not just text, it is structured text. The structure is derived from the more-or-less-hierarchical sequence of start tags and end tags. Usually you don't work with HTML this way; you work with it textually in a text editor, or visually in a web browser or web authoring tool. sgmllib.py presents HTML structurally.
sgmllib.py contains one important class: SGMLParser. SGMLParser parses HTML into useful pieces, like start tags and end tags. As soon as it succeeds in breaking down some data into a useful piece, it calls a method on itself based on what it found. In order to use the parser, you subclass the SGMLParser class and override these methods. This is what I meant when I said that it presents HTML structurally: the structure of the HTML determines the sequence of method calls and the arguments passed to each method.
SGMLParser parses HTML into 8 kinds of data, and calls a separate method for each of them:
|Python 2.0 had a bug where SGMLParser would not recognize declarations at all (handle_decl would never be called), which meant that DOCTYPEs were silently ignored. This is fixed in Python 2.1.|
sgmllib.py comes with a test suite to illustrate this. You can run sgmllib.py, passing the name of an HTML file on the command line, and it will print out the tags and other elements as it parses them. It does this by subclassing the SGMLParser class and defining unknown_starttag, unknown_endtag, handle_data and other methods which simply print their arguments.
|In the ActivePython IDE on Windows, you can specify command line arguments in the “Run script” dialog. Separate multiple arguments with spaces.|
Here is a snippet from the table of contents of the HTML version of this book. Of course your paths may vary. (If you haven't downloaded the HTML version of the book, you can do so at http://diveintopython.org/.
c:\python23\lib> type "c:\downloads\diveintopython\html\toc\index.html" <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> <title>Dive Into Python</title> <link rel="stylesheet" href="diveintopython.css" type="text/css"> ... rest of file omitted for brevity ...
Running this through the test suite of sgmllib.py yields this output:
c:\python23\lib> python sgmllib.py "c:\downloads\diveintopython\html\toc\index.html" data: '\n\n' start tag: <html lang="en" > data: '\n ' start tag: <head> data: '\n ' start tag: <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" > data: '\n \n ' start tag: <title> data: 'Dive Into Python' end tag: </title> data: '\n ' start tag: <link rel="stylesheet" href="diveintopython.css" type="text/css" > data: '\n ' ... rest of output omitted for brevity ...
Here's the roadmap for the rest of the chapter:
Along the way, you'll also learn about locals, globals, and dictionary-based string formatting.
<< HTML Processing
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Extracting data from HTML documents >>