xsxp(3tcl) Amazon S3 Web Service Utilities xsxp(3tcl)
______________________________________________________________________________
NAME
xsxp - eXtremely Simple Xml Parser
SYNOPSIS
package require Tcl 8.4
package require xsxp 1
package require xml
xsxp::parse xml
xsxp::fetch pxml path ?part?
xsxp::fetchall pxml_list path ?part?
xsxp::only pxml tagname
xsxp::prettyprint pxml ?chan?
______________________________________________________________________________
DESCRIPTION
This package provides a simple interface to parse XML into a pure-value
list. It also provides accessor routines to pull out specific subtags,
not unlike DOM access. This package was written for and is used by
Darren New's Amazon S3 access package.
This is pretty lame, but I needed something like this for S3, and at
the time, TclDOM would not work with the new 8.5 Tcl due to version
number problems.
In addition, this is a pure-value implementation. There is no garbage
to clean up in the event of a thrown error, for example. This simpli-
fies the code for sufficiently small XML documents, which is what Ama-
zon's S3 guarantees.
Copyright 2006 Darren New. All Rights Reserved. NO WARRANTIES OF ANY
TYPE ARE PROVIDED. COPYING OR USE INDEMNIFIES THE AUTHOR IN ALL WAYS.
This software is licensed under essentially the same terms as Tcl. See
LICENSE.txt for the terms.
COMMANDS
The package implements five rather simple procedures. One parses, one
is for debugging, and the rest pull various parts of the parsed docu-
ment out for processing.
xsxp::parse xml
This parses an XML document (using the standard xml tcllib mod-
ule in a SAX sort of way) and builds a data structure which it
returns if the parsing succeeded. The return value is referred
to herein as a "pxml", or "parsed xml". The list consists of two
or more elements:
o The first element is the name of the tag.
o The second element is an array-get formatted list of
key/value pairs. The keys are attribute names and the
values are attribute values. This is an empty list if
there are no attributes on the tag.
o The third through end elements are the children of the
node, if any. Each child is, recursively, a pxml.
o Note that if the zero'th element, i.e. the tag name, is
"%PCDATA", then the attributes will be empty and the
third element will be the text of the element. In addi-
tion, if an element's contents consists only of PCDATA,
it will have only one child, and all the PCDATA will be
concatenated. In other words, this parser works poorly
for XML with elements that contain both child tags and
PCDATA. Since Amazon S3 does not do this (and for that
matter most uses of XML where XML is a poor choice don't
do this), this is probably not a serious limitation.
xsxp::fetch pxml path ?part?
pxml is a parsed XML, as returned from xsxp::parse. path is a
list of element tag names. Each element is the name of a child
to look up, optionally followed by a hash ("#") and a string of
digits. An empty list or an initial empty element selects pxml.
If no hash sign is present, the behavior is as if "#0" had been
appended to that element. (In addition to a list, slashes can
separate subparts where convenient.)
An element of path scans the children at the indicated level for
the n'th instance of a child whose tag matches the part of the
element before the hash sign. If an element is simply "#" fol-
lowed by digits, that indexed child is selected, regardless of
the tags in the children. Hence, an element of "#3" will always
select the fourth child of the node under consideration.
part defaults to "%ALL". It can be one of the following case-
sensitive terms:
%ALL returns the entire selected element.
%TAGNAME
returns lindex 0 of the selected element.
%ATTRIBUTES
returns index 1 of the selected element.
%CHILDREN
returns lrange 2 through end of the selected element, re-
sulting in a list of elements being returned.
%PCDATA
returns a concatenation of all the bodies of direct chil-
dren of this node whose tag is %PCDATA. It throws an er-
ror if no such children are found. That is, part=%PCDATA
means return the textual content found in that node but
not its children nodes.
%PCDATA?
is like %PCDATA, but returns an empty string if no PCDATA
is found.
For example, to fetch the first bold text from the fifth paragraph of
the body of your HTML file,
xsxp::fetch $pxml {body p#4 b} %PCDATA
xsxp::fetchall pxml_list path ?part?
This iterates over each PXML in pxml_list (which must be a list
of pxmls) selecting the indicated path from it, building a new
list with the selected data, and returning that new list.
For example, pxml_list might be the %CHILDREN of a particular
element, and the path and part might select from each child a
sub-element in which we're interested.
xsxp::only pxml tagname
This iterates over the direct children of pxml and selects only
those with tagname as their tag. Returns a list of matching ele-
ments.
xsxp::prettyprint pxml ?chan?
This outputs to chan (default stdout) a pretty-printed version
of pxml.
BUGS, IDEAS, FEEDBACK
This document, and the package it describes, will undoubtedly contain
bugs and other problems. Please report such in the category amazon-s3
of the Tcllib Trackers [http://core.tcl.tk/tcllib/reportlist]. Please
also report any ideas for enhancements you may have for either package
and/or documentation.
When proposing code changes, please provide unified diffs, i.e the out-
put of diff -u.
Note further that attachments are strongly preferred over inlined
patches. Attachments can be made by going to the Edit form of the
ticket immediately after its creation, and then using the left-most
button in the secondary navigation bar.
KEYWORDS
dom, parser, xml
CATEGORY
Text processing
COPYRIGHT
2006 Darren New. All Rights Reserved.
tcllib 1.0 xsxp(3tcl)