htmlparse(3tcl) HTML Parser htmlparse(3tcl)
______________________________________________________________________________
NAME
htmlparse - Procedures to parse HTML strings
SYNOPSIS
package require Tcl 8.2
package require struct::stack 1.3
package require cmdline 1.1
package require htmlparse ?1.2.2?
::htmlparse::parse ?-cmd cmd? ?-vroot tag? ?-split n? ?-incvar var?
?-queue q? html
::htmlparse::debugCallback ?clientdata? tag slash param textBehind-
TheTag
::htmlparse::mapEscapes html
::htmlparse::2tree html tree
::htmlparse::removeVisualFluff tree
::htmlparse::removeFormDefs tree
______________________________________________________________________________
DESCRIPTION
The htmlparse package provides commands that allow libraries and appli-
cations to parse HTML in a string into a representation of their
choice.
The following commands are available:
::htmlparse::parse ?-cmd cmd? ?-vroot tag? ?-split n? ?-incvar var?
?-queue q? html
This command is the basic parser for HTML. It takes an HTML
string, parses it and invokes a command prefix for every tag en-
countered. It is not necessary for the HTML to be valid for this
parser to function. It is the responsibility of the command in-
voked for every tag to check this. Another responsibility of the
invoked command is the handling of tag attributes and character
entities (escaped characters). The parser provides the un-inter-
preted tag attributes to the invoked command to aid in the for-
mer, and the package at large provides a helper command, ::html-
parse::mapEscapes, to aid in the handling of the latter. The
parser does ignore leading DOCTYPE declarations and all valid
HTML comments it encounters.
All information beyond the HTML string itself is specified via
options, these are explained below.
To help understand the options, some more background information
about the parser.
It is capable of detecting incomplete tags in the HTML string
given to it. Under normal circumstances this will cause the
parser to throw an error, but if the option -incvar is used to
specify a global (or namespace) variable, the parser will store
the incomplete part of the input into this variable instead.
This will aid greatly in the handling of incrementally arriving
HTML, as the parser will handle whatever it can and defer the
handling of the incomplete part until more data has arrived.
Another feature of the parser are its two possible modes of op-
eration. The normal mode is activated if the option -queue is
not present on the command line invoking the parser. If it is
present, the parser will go into the incremental mode instead.
The main difference is that a parser in normal mode will immedi-
ately invoke the command prefix for each tag it encounters. In
incremental mode however the parser will generate a number of
scripts which invoke the command prefix for groups of tags in
the HTML string and then store these scripts in the specified
queue. It is then the responsibility of the caller of the parser
to ensure the execution of the scripts in the queue.
Note: The queue object given to the parser has to provide the
same interface as the queue defined in tcllib -> struct. This
means, for example, that all queues created via that tcllib mod-
ule can be immediately used here. Still, the queue doesn't have
to come from tcllib -> struct as long as the same interface is
provided.
In both modes the parser will return an empty string to the
caller.
The -split option may be given to a parser in incremental mode
to specify the size of the groups it creates. In other words,
-split 5 means that each of the generated scripts will invoke
the command prefix for 5 consecutive tags in the HTML string. A
parser in normal mode will ignore this option and its value.
The option -vroot specifies a virtual root tag. A parser in nor-
mal mode will invoke the command prefix for it immediately be-
fore and after it processes the tags in the HTML, thus simulat-
ing that the HTML string is enclosed in a <vroot> </vroot> com-
bination. In incremental mode however the parser is unable to
provide the closing virtual root as it never knows when the in-
put is complete. In this case the first script generated by each
invocation of the parser will contain an invocation of the com-
mand prefix for the virtual root as its first command. The fol-
lowing options are available:
-cmd cmd
The command prefix to invoke for every tag in the HTML
string. Defaults to ::htmlparse::debugCallback.
-vroot tag
The virtual root tag to add around the HTML in normal
mode. In incremental mode it is the first tag in each
chunk processed by the parser, but there will be no clos-
ing tags. Defaults to hmstart.
-split n
The size of the groups produced by an incremental mode
parser. Ignored when in normal mode. Defaults to 10. Val-
ues <= 0 are not allowed.
-incvar var
The name of the variable where to store any incomplete
HTML into. This makes most sense for the incremental
mode. The parser will throw an error if it sees incom-
plete HTML and has no place to store it to. This makes
sense for the normal mode. Only incomplete tags are de-
tected, not missing tags. Optional, defaults to 'no
variable'.
Interface to the command prefix
In normal mode the parser will invoke the command prefix
with four arguments appended. See ::htmlparse::debugCall-
back for a description.
In incremental mode, however, the generated scripts will
invoke the command prefix with five arguments appended.
The last four of these are the same which were mentioned
above. The first is a placeholder string (@win@) for a
clientdata value to be supplied later during the actual
execution of the generated scripts. This could be a tk
window path, for example. This allows the user of this
package to preprocess HTML strings without committing
them to a specific window, object, whatever during pars-
ing. This connection can be made later. This also means
that it is possible to cache preprocessed HTML. Of
course, nothing prevents the user of the parser from re-
placing the placeholder with an empty string.
::htmlparse::debugCallback ?clientdata? tag slash param textBehind-
TheTag
This command is the standard callback used by the parser in
::htmlparse::parse if none was specified by the user. It simply
dumps its arguments to stdout. This callback can be used for
both normal and incremental mode of the calling parser. In other
words, it accepts four or five arguments. The last four argu-
ments are described below. The optional fifth argument contains
the clientdata value passed to the callback by a parser in in-
cremental mode. All callbacks have to follow the signature of
this command in the last four arguments, and callbacks used in
incremental parsing have to follow this signature in the last
five arguments.
The first argument, clientdata, is optional and present only if
this command is invoked by a parser in incremental mode. It con-
tains whatever the user of this package wishes.
The second argument, tag, contains the name of the tag which is
currently processed by the parser.
The third argument, slash, is either empty or contains a slash
character. It allows the callback to distinguish between opening
(slash is empty) and closing tags (slash contains a slash char-
acter).
The fourth argument, param, contains the un-interpreted list of
parameters to the tag.
The fifth and last argument, textBehindTheTag, contains the text
found by the parser behind the tag named in tag.
::htmlparse::mapEscapes html
This command takes a HTML string, substitutes all escape se-
quences with their actual characters and then returns the re-
sulting string. HTML strings which do not contain escape se-
quences are returned unchanged.
::htmlparse::2tree html tree
This command is a wrapper around ::htmlparse::parse which takes
an HTML string (in html) and converts it into a tree containing
the logical structure of the parsed document. The name of the
tree is given to the command as its second argument (tree). The
command does not generate the tree by itself but expects that
the caller provided it with an existing and empty tree. It also
expects that the specified tree object follows the same inter-
face as the tree object in tcllib -> struct. It doesn't have to
be from tcllib -> struct, but it must provide the same inter-
face.
The internal callback does some basic checking of HTML validity
and tries to recover from the most basic errors. The command re-
turns the contents of its second argument. Side effects are the
creation and manipulation of a tree object.
Each node in the generated tree represent one tag in the input.
The name of the tag is stored in the attribute type of the node.
Any html attributes coming with the tag are stored unmodified in
the attribute data of the tag. In other words, the command does
not parse html attributes into their names and values.
If a tag contains text its node will have children of type PC-
DATA containing this text. The text will be stored in the attri-
bute data of these children.
::htmlparse::removeVisualFluff tree
This command walks a tree as generated by ::htmlparse::2tree and
removes all the nodes which represent visual tags and not struc-
tural ones. The purpose of the command is to make the tree eas-
ier to navigate without getting bogged down in visual informa-
tion not relevant to the search. Its only argument is the name
of the tree to cut down.
::htmlparse::removeFormDefs tree
Like ::htmlparse::removeVisualFluff this command is here to cut
down on the size of the tree as generated by ::htmlparse::2tree.
It removes all nodes representing forms and form elements. Its
only argument is the name of the tree to cut down.
BUGS, IDEAS, FEEDBACK
This document, and the package it describes, will undoubtedly contain
bugs and other problems. Please report such in the category htmlparse
of the Tcllib Trackers [http://core.tcl.tk/tcllib/reportlist]. Please
also report any ideas for enhancements you may have for either package
and/or documentation.
When proposing code changes, please provide unified diffs, i.e the out-
put of diff -u.
Note further that attachments are strongly preferred over inlined
patches. Attachments can be made by going to the Edit form of the
ticket immediately after its creation, and then using the left-most
button in the secondary navigation bar.
SEE ALSO
struct::tree
KEYWORDS
html, parsing, queue, tree
CATEGORY
Text processing
tcllib 1.2.2 htmlparse(3tcl)