XMLDECLARATION: integer indicates begin of document
XMLMODE: integer for switching on XML processing
XMLSTARTELEM: string holds tag upon entering element
XMLATTR: array holds attribute names and values
XMLENDELEM: string holds tag upon leaving element
XMLCHARDATA: string holds character data
XMLPROCINST: string holds processing instruction target
XMLCOMMENT: string holds comment
XMLSTARTCDATA: integer indicates begin of CDATA
XMLENDCDATA: integer indicates end of CDATA
LANG: env variable holds default character encoding
XMLCHARSET: string holds current character set
XMLSTARTDOCT: root tag name indicates begin of DTD
XMLENDDOCT: integer indicates end of DTD
XMLUNPARSED: string holds unparsed characters
XMLERROR: string holds textual error description
XMLROW: integer holds current row of parsed item
XMLCOL: integer holds current column of parsed item
XMLLEN: integer holds length of parsed item
XMLDEPTH: integer holds nesting depth of elements
XMLPATH: string holds nested tags of parsed elements
XMLENDDOCUMENT: integer indicates end of XML data
XMLEVENT: string holds name of event
XMLNAME: string holds name assigned to XMLEVENT
This file documents the XML processing extension in GNU Awk (gawk) version 3.2 and later.
This is Edition 0.3 of XML Processing With gawk, for the 3.1.6 (or later) version of the GNU implementation of AWK.
Copyright (C) 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being “GNU General Public License”, the Front-Cover texts being (a) (see below), and with the Back-Cover Texts being (b) (see below). A copy of the license is included in the section entitled “GNU Free Documentation License”.
In June of 2003, I was confronted with some textual configuration files in XML format and I was scared by the fact that my favorite tools (grep and awk) turned out to be mostly useless for extracting information from these files. It looked as if AWK's way of processing files line by line had to be replaced by a node-traversal of tree-like XML data. For the first implementation of an extended gawk, I chose the expat library to help me reading XML files. With a little help from Stefan Tramm I went on selecting features and implemented what is now called XMLgawk over the Christmas Holidays 2003. In June 2004, Manuel Collado joined us and started collecting his comments and proposals for the extension. Manuel also wrote a library for reading XML files into a DOM-like structure. In Septermber 2004, I wrote the first version of this web page. Andrew Schorr flooded my mailbox with patches and suggestions for changes. His initiative pushed me into starting the SourceForge project. This happened in March 2005 and since then, all software changes go into a CVS source tree at SourceForge (thanks to them for providing this service). Andrew's urgent need for a production system drove development in early 2005. Significant changes were made:
XMLEVENT and XMLNAME.
-l option and @load were introduced.
Makefile.am etc.)
and found an installation mechanism which allows an easy
and collision-free installation in the same directory as Arnold's
GNU Awk.
He also made Arnold's igawk obsolete by implementing the
-i option. April 2005 saw the Alpha release of xgawk,
as a branch of gawk-3.1.4.
In August 2005, Hirofumi Saito held a presentation at the
Lightweight Language Day and Night
in Japan. His little slideshow demonstrated the importance
of multibyte characters in any modern programming language.
Hirofumi Saito also did the localization of our source code for the
Japanese language.
Kimura Koichi reported and fixed some problems in handling of
multibyte characters. He also found ways to get all this running
on several flavours of Microsoft Windows.
Meanwhile in Summer 2005, Arnold had released gawk-3.1.5 and
I applied all his 219 patches to our CVS tree over the Christmas
Holidays 2005. Andrew applied some more bug fixes from the GNU mailing
archive and so the current Beta release of xgawk-3.1.5 is
already a bit ahead of Arnold's gawk-3.1.5.
Jürgen Kahrs
In August 2006, Arnold and friends set up a mirror of Arnold's
source tree as a CVS repository at
Savannah.
It is now much easier for us to understand recent changes in
Arnold's source tree. We strive to merge all of them immediately
to our source tree. This merge process has been enormously simplified
by a weekly cron job mechanism (implemented by Andrew)
that examines recent activities in Arnold's tree and sends an
email to our mailing list.
Some more problems and fixes in handling multibyte characters have been reported by our Japanese friends to Arnold and us. For example, Hirofumi Saito and others forwarded patches for the half-width katakana characters in character classes in ShiftJIS locale.
Since January 2007, there is a new target valgrind in
the top level Makefile. This feature was implemented
for detection of memory leaks in the interpreter while running
the regression test cases. We found small memory leaks in our
time and mpfr extension instantly with this new
feature.
March 2007 saw much activity. First we introduced Victor Paesa's new extension for the GD library. Then we merged Paul Eggert's file floatcomp.c (floating point / long comparison) from Arnold's source tree. We also merged the changes in regression test cases and documentation due to changed behaviour in numerical calculations (infinity and not a number) and formatting of these.
Stefan Tramm held a 5-minute Lightning Talk on xgawk at OSCON 06.
Hirofumi Saito took part in the Lightweight Language Conference 2006. Matt Rosin has summarized the event in English.
The new Reading XML Data with POSIX AWK describes a template
script getXMLEVENT.awk that allows us to write portable
scripts in a manner that is a mostly compatible subset of the
XMLgawk API. Such scripts can be run on any POSIX-compliant AWK
interpreter – not just xgawk.
The new Copying and Modifying with the xmlcopy.awk library script describes a library script for making slightly modified copies of XML data.
Thanks to Andrew's mechanism for systematic reporting of patches
applied by Arnold to his gawk-stable tree, Andrew and I
caught up with recent changes in Arnold's source tree.
As a consequence, xgawk is now based upon the recent official
gawk-3.1.6 release.
Jürgen Kahrs
FIXME: This document has not been completed yet. The incomplete portions have been marked with comments like this one.
FIXME: The scope of this document has to be decided upon. Is this a document about the XMLgawk extension only or is this a document about all extensions of GNU Awk ? Andrew has inserted descriptions of the PostgreSQL and the time extension in appendices.
This chapter provides a (necessarily) brief intoduction to XML concepts. For many applications of gawk XML processing, we hope that this is enough. For more advanced tasks, you will need deeper background, and it may be necessary to switch to other tools like XSL processors
But before we look at XML, let us first reiterate how AWK's program execution works and what to expect from XML processing within this framework. The gawk man page summarizes AWK's basic execution model as follows:
An AWK program consists of a sequence of pattern-action statements and optional function definitions.
pattern { action statements }
function name(parameter list) { statements }
...For each record in the input, gawk tests to see if it matches any pattern in the AWK program. For each pattern that the record matches, the associated action is executed. The patterns are tested in the order they occur in the program. Finally, after all the input is exhausted, gawk executes the code in the END block(s) (if any).
A look at a short and simple example will reveal the strength of this abstract description. The following script implements the Unix tool wc (well, almost, but not completely).
BEGIN { words=0 }
{ words+=NF }
END { print NR, words }
Before opening the file to be processed, the word counter is initialized with 0. Then the file is opened and for each line the number of fields (which equals the number of words) is added to the current word counter. After reading all lines of the file, the resulting word counter is printed as well as the number of lines.
Store the lines above in a file named wc.awk and invoke it with
gawk -f wc.awk datafile.xml
This kind of invocation will work on all platforms. In a Unix environment (or in the Cygwin Unix-emulation on top of Microsoft Windows) it is more comfortable to store the script above into an executable file. To do so, write a file named wc.awk, with the first line being
#!/usr/bin/gawk -f
followed by the lines above. Then make the file wc.wk executable with
chmod a+x wc.awk
and invoke it as
wc.awk datafile.xml
When looking at fig:awk_proc from top to bottom, you will recognize that each line of the data file is represented by a row in the figure. In each row you see NR (the number of the current line) on the left and the pattern (the condition for execution) and its action on the right. The first and last rows represent BEGIN (initialization) and END (finalization).
We could use this script to process any XML file. But the result it yielded would not be too meaningful to us. When processing XML files, you are not really interested in the number of lines or words. Take, for example, this XML file, a DocBook file to be precise.
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
"/afs/cern.ch/sw/XML/XMLBIN/share/www.oasis-open.org/docbook/xmldtd-4.2/docbookx.dtd"
>
<book id="hello-world" lang="en">
<bookinfo>
<title>Hello, world</title>
</bookinfo>
<chapter id="introduction">
<title>Introduction</title>
<para>This is the introduction. It has two sections</para>
<sect1 id="about-this-book">
<title>About this book</title>
<para>This is my first DocBook file.</para>
</sect1>
<sect1 id="work-in-progress">
<title>Warning</title>
<para>This is still under construction.</para>
</sect1>
</chapter>
</book>
Figure 1.2: Example of some XML data (DocBook file)
Reading through this jungle of angle brackets, you will notice that
the notion of a line is not an adequate concept to describe what
you see. AWK's idea of records and fields only makes sense
in a rectangular world of textual data being stored in rows and columns.
This notion is blind to XML's notion of structuring textual data into
markup blocks (like <title>Introduction</title>), with beginning
and ending being marked as such by angle brackets. Furthermore,
XML's markup blocks can contain other blocks (like a chapter contains
a title and a para). XML sees textual data as
a tree with deeply nested nodes (markup blocks). A tree is a dynamic data structure;
some people call it a recursive structure because a tree contains
other trees, which may contain even other trees. These sub-trees are not
numbered (as rows and columns) but they have names. Now that we have a coarse
understanding of the structure of an XML file, we can choose an adequate way
of picturing the situation. XML data has a tree structure, so let's draw the
example file in Figure 1.2 above as a tree (see fig:docbook_chapter).
You can easily see that each markup block is drawn as a node in this tree. The edges in the tree reveal the nesting of the markup blocks in a much more lucid way than the textual representation. Each edge indicates that the markup block which has an arrow pointing to it, is contained in the markup block from which which the edge comes. Such edges indicate the "parent-child" relationship.
Now, what could be the equivalent of a wc command when dealing with such trees of markup blocks ? We could count the nodes of the tree. You can store and invoke the following script in the same way as you did for the previous script.
BEGIN { nodes = 0 }
XMLSTARTELEM { nodes ++ }
END { print nodes }
If you invoke this script with the data file in Figure 1.2, the number of nodes will be printed immediately:
gawk -l xml -f node_count.awk dbfile.xml
12
Notice the similarity between this example script and the original wc.awk which counts words. Instead of going over the lines, this script traverses the tree and increments the node counter each time a node is found. After a closer look you will find several differences between the previous script and the present one:
-l xml.
This is necessary for loading the XML extension into the gawk
interpreter so that the
gawk interpreter knows that the file to be opened is an
XML file and has to be treated differently.
XMLSTARTELEM pattern.
bookinfo node and the chapter node are both positioned
directly under the book node; but which is counted first ?
The answer becomes clear when we return to the textual representation
of the tree — textual order induces traversal order.
Do you see the numbers near the arrow heads ? These are the numbers indicating traversal order. The number 1 is missing because it is clear that the root node (framed with a bold line) is visited first. Computer Scientists call this traversal order depth-first because at each node, its children (the deeper nodes) are visited before going on with nodes at the same level. There are other orders of traversal ( breadth-first ) but the textual order in Figure 1.2 enforces the numbering in Figure 1.3.
The tree in Figure 1.3 is not balanced. The very last nodes are nested so deep that they are printed on the very right of the margin in Figure 1.3. This is not the case for the upper part of the drawing. Sometimes it is useful to know the maximum depth of such a tree. The following script traverses all nodes and at each node it compares actual depth and maximum depth to find and remember the largest depth.
@load xml
XMLSTARTELEM {
depth++
if (depth > max_depth)
max_depth = depth
}
XMLENDELEM { depth-- }
END { print max_depth }
Figure 1.4: Finding the maximum depth of the tree representation of an XML file with the script max_depth.awk
If you compare this script to the previous one, you will again notice some subtle differences.
-l xml on the
command line. If the source text of your script is stored in an
executable file, you should start the script with loading all
extensions into the interpreter. The command line option -l xml
should only be used as a shorthand notation when you are working with
a one-line command line.
XMLENDELEM. This is the counterpart of the pattern
XMLSTARTELEM. One is true upon entering a node,
the other is true upon leaving the node. In the textual
representation, these patterns mark the beginning and the
ending of a markup block. Each time the script enters a
markup block, the depth counter is increased and
each time a markup block is left, the depth counter
is decreased.
Later we will learn that this script can be shortened even more by using the builtin variable XMLDEPTH which contains the nesting depth of markup blocks at any point in time. With the use of this variable, the script in Figure 1.4 becomes one of these one-liners which are so typical for daily work with gawk.
If you already know the basics of XML terminology, you can skip this section and advance to the next chapter. Otherwise, we recommend studying the O'Reilly book XML in a Nutshell, which is a good combination of tutorial and reference. Basic terminology can be found in chapter 2 (XML Fundamentals). If you prefer (free) online tutorials, then we recommend w3schools. See Links to the Internet, for additional valuable material.
Before going on reading, you should make sure you know the meaning of the following terms. Instead of leaving you on your own with learning these terms, we will give an informal and insufficient explanation of each of the terms. Always refer to Figure 1.2 for an example and consider looking the term up in one of the sources given above.
lang) and a value (en)
bookinfo including title
<?xml version="1.0" encoding="ISO-8859-1"?>
Still reading ? Be warned that these definitions are formally incorrect.
They are meant to get you on the right track.
Each ambitious propeller head will happily tear these definitions apart.
If you are seriously striving to become an XML propeller head yourself,
then you should not miss reading the original defining documents about the
XML technology.
The proper playing ground for anxious aspirants is the newsgroup
comp.text.xml.
I am glad none of those propeller heads reads gawk web pages — they would kill me.
Some users will try to avoid the use of the new language features described earlier. They want to write portable scripts; they have to refrain from using features which are not part of the standardized POSIX AWK. Since the XML extension of GNU Awk is not part of the POSIX standard, these users have to find different ways of reading XML data.
Implementing a complete XML reader in POSIX AWK would mean that all subtle details of
Unicode encodings had to be handled. It doesn't make sense to go into
such details with an AWK script. But in 2001, Steve Coile wrote a parser
which is good enough if your XML data consists of simple tagged blocks
of ASCII characters. His script is available on the Internet as
xmlparse.awk.
The source code of xmlparse.awk is well documented and ready-to-use
for anyone having access to the Internet.
Begin your exploration of xmlparse.awk by downloading it.
As of Summer 2007, there is a typo in the file that has to be
corrected before you can start to work with the parser. Insert
a hashmark character (#) in front of the comment in line
342.
wget ftp://ftp.freefriends.org/arnold/Awkstuff/xmlparser.awk
vi xmlparser.awk
342G
i#
ESC
:wq
While you're editing the parser script, have a look at the comments.
This is a well-documented script that explains its implementation
as well as some use cases. For example, the header summarizes almost
all details that a user will need to remember (see fig:ch2_xmlparser_header).
There is a negligible inconsistency in the header: The file is really
named xmlparser.awk and not xmlparse.awk as stated in the
header. From a user's perspective, the most important constraint to
keep in mind is that this XML parser needs a modern variant of
AWK. This means a POSIX compliant AWK; the old Solaris implentation
oawk will not be able to interpret this XML parser script as
intended. Invoke the XML parser for the first time with
awk -f xmlparser.awk docbook_chapter.xml
Compare the output to the original file's content (see Figure 1.2) and its depiction as a tree (see Figure 1.3). You will notice that the first column of the output always contains the type of the items as they were parsed sequentially:
pi xml version="1.0" encoding="UTF-8"
data \n\n
decl DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"\n "/afs/cern.ch/sw/XML/XMLBIN/share/www.oasis-open.org/docbook/xmldtd-4.2/docbookx.dtd"\n
data \n
begin BOOK
attrib id
value hello-world
attrib lang
value en
data \n \n
begin BOOKINFO
data \n
begin TITLE
data Hello, world
end TITLE
This is in accordance with the guiding principles explained in the header of the parser script. Note that the description in fig:ch2_xmlparser_header is incomplete. More details will be provided below.
The script parses the XML data and saves each parsed item in two arrays:
type[3] indicates the type of the 3rd parsed XML data item.
This may be any of
item[3] contains the text
of the error message.
type[3] containing "begin" or "end",
item[3] contains the name of the tag.
type[3] containing "attrib" or "value",
item[3] contains the attribute's name or value.
item[3] contains data depending on type[3], as
distinguished in the item list above.
While you proceed reading this web page, you will notice that
the basic idea is similar to what XMLgawk does. Especially the approach
described in XMLgawk Core Language Interface Summary as
Style B - Reduced set of variables shared by all events
will look familiar. The script as it is was not designed to be a
modular building block. Any application will not simply include
the xmlparser.awk file, but copy it textually and modify
the copy. Look into the original script once more and have a
closer look at the final END pattern. You will find suggestions
for several useful applications inside the END pattern.
##############################################################################
#
# xmlparse.awk - A simple XML parser for awk
#
# Author: Steve Coile <scoile@csc.com>
#
# Version: 1.0 (20011017)
#
# Synopsis:
#
# awk -f xmlparse.awk [FILESPEC]...
#
# Description:
#
# This script is a simple XML parser for (modern variants of) awk.
# Input in XML format is saved to two arrays, "type" and "item".
#
# The term, "item", as used here, refers to a distinct XML element,
# such as a tag, an attribute name, an attribute value, or data.
#
# The indexes into the arrays are the sequence number that a
# particular item was encountered. For example, the third item's
# type is described by type[3], and its value is stored in item[3].
#
# The "type" array contains the type of the item encountered for
# each sequence number. Types are expressed as a single word:
# "error" (invalid item or other error), "begin" (open tag),
# "attrib" (attribute name), "value" (attribute value), "end"
# (close tag), and "data" (data between tags).
#
# The "item" array contains the value of the item encountered
# for each sequence number. For types "begin" and "end", the
# item value is the name of the tag. For "error", the value is
# the text of the error message. For "attrib", the value is the
# attribute name. For "value", the value is the attribute value.
# For "data", the value is the raw data.
#
# WARNING: XML-quoted values ("entities") in the data and attribute
# values are *NOT* unquoted; they are stored as-is.
#
###############################################################################
Figure 2.1: Usage explained in the header of xmlparser.awk
if (type[idx] == "error") {
...
}
it is quite easy to implement a script that checks for well-formedness of some XML data.
END pattern.
It demonstrates how to go through all parsed items sequentially
and handle each of the types appropriately.
for ( n = 1; n <= idx; n += 1 ) {
if ( type[n] == "attrib" ) {
} else if ( type[n] == "begin" ) {
} else if ( type[n] == "cdata" ) {
} else if ( type[n] == "comment" ) {
} else if ( type[n] == "data" ) {
} else if ( type[n] == "decl" ) {
} else if ( type[n] == "end" ) {
} else if ( type[n] == "error" ) {
} else if ( type[n] == "pi" ) {
} else if ( type[n] == "value" ) {
}
}
xmlparser.awk script. Notice that this is done
after the complete XML data has been read. So, at the moment of
processing, the complete XML data is somehow saved in AWK's memory,
imposing some limit on the size of the data that can be processed.
XMLDEPTH=0
for (n = 1; n <= idx; n += 1 ) {
if ( type[n] == "attrib") { printf(" %s=", item[n] )
} else if (type[n] == "value" ) { printf("'%s'", item[n] )
} else if (type[n] == "begin" ) { printf("\n%*s%s", 2*XMLDEPTH,"", item[n])
XMLDEPTH ++
} else if (type[n] == "end" ) { XMLDEPTH -- }
}
If you compare the output of this application with fig:ch2_dbfile
you will notice only two differences. The first is the newline character
before the very first tag; the second is the names of the tags.
The xmlparser.awk script saves the names of the tags in
uppercase letters, the exact tag name cannot be revovered without
changing the internals of the XML parsing mechanism.
In 2005, Jan Weber posted a similar XML parser to the newsgroup
comp.lang.awk.
You can use Google to search for the script getXML and
copy it into a file. Unfortunately, Jan tried to make the script
as short as possible and often put several statements on one line.
Readability of the script has suffered severely and if you intend
to analyse the script, be prepared that some editing may be
necessary to understand it. Again, while you're editing the parser
script, have a look at the comments. Jan has commented the one
central function of the script from a user's perspective as
follows (see fig:ch2_getXML_header). The basic approach was
taken over from the xmlparser.awk script. But there were
several constraints Jan tried to satisfy in writing his XML parser:
getXML allows to read multiple XML files in parallel.
getXML function, similar to the getline
mechanism of AWK (see fig:ch2_getXML). Furthermore, the
user application reads files in the BEGIN action of
the AWK script, not in the END action.
xmlparser.awk script stored the complete XML
data into two arrays during the parsing process, getXML.awk
passes one XML event at a time back to the calling application,
avoiding the unwanted waste of memory. This means, parsing large
XML files becomes possible (although it doesn't make too much sense).
nawk implementation of the
AWK language that comes with the Solaris Operating System.
As a consequence, this XML parser is probably the most portable
of all parsers described in this web page.
Again, we will demonstrate the usage of this XML parser by
implementing an outline script like the one in fig:ch2_outline.awk.
Change the file getXML and replace the existing BEGIN
action with the script in fig:ch2_getXML.
Invoke the new outline parser for the first time with
awk -f getXML docbook_chapter.xml
Compare the output to the original file's content (see Figure 1.2),
its depiction as a tree (see Figure 1.3) and to the output
of the original outline tool that comes with the expat
parser (see fig:ch2_dbfile).
The result is almost identical to fig:ch2_dbfile, except for
one minor detail: The very first line is a blank line here.
##
# getXML( file, skipData ): # read next xml-data into XTYPE,XITEM,XATTR
# Parameters:
# file -- path to xml file
# skipData -- flag: do not read "DAT" (data between tags) sections
# External variables:
# XTYPE -- type of item read, e.g. "TAG"(tag), "END"(end tag), "COM"(comment), "DAT"(data)
# XITEM -- value of item, e.g. tagname if type is "TAG" or "END"
# XATTR -- Map of attributes, only set if XTYPE=="TAG"
# XPATH -- Path to current tag, e.g. /TopLevelTag/SubTag1/SubTag2
# XLINE -- current line number in input file
# XNODE -- XTYPE, XITEM, XATTR combined into a single string
# XERROR -- error text, set on parse error
# Returns:
# 1 on successful read: XTYPE, XITEM, XATTR are set accordingly
# "" at end of file or parse error, XERROR is set on error
# Private Data:
# _XMLIO -- buffer, XLINE, XPATH for open files
##
Figure 2.2: Usage of Jan Weber's getXML parser function
But some implementation details are noteworthy. Here, granularity of items
is different: All attributes are reported along with their tag item.
This results from a design decision: The getXML function
uses several variables to pass larger amounts of data back to the
caller. Finally a detail that did not become so obvious in this example.
Notice the second parameter of the getXML function (skipData).
Jan introduced an option that allows skipping textual data in
between tags (mixed content).
#!/usr/bin/nawk -f
BEGIN {
XMLDEPTH=0
while ( getXML(ARGV[1],1) ) {
if ( XTYPE == "TAG" ) {
printf("\n%*s%s", 2*XMLDEPTH, "", XITEM)
XMLDEPTH++
for (attrName in XATTR)
printf(" %s='%s'", attrName, XATTR[attrName])
} else if ( XTYPE == "END" ) {
XMLDEPTH--
}
}
}
Figure 2.3: Outlining an XML file with Jan Weber's getXML parser
Jan Webers's portable script in the previous section was a significant advance over Steve Coile's script. Handling of XML events feels much more like it does in the XMLgawk API. But after some time of working with the script, the differences between it and the XMLgawk API become a bit annoying to remember. As a consequence, we took Jan's script, copied it into a new script file getXMLEVENT.awk and changed its inner working so as to minimize differences to the XMLgawk API. If you intend to use the script as a template for your own work, search for the file getXMLEVENT.awk in the following places:
xgawk distribution file contains a copy in the
awklib/xml directory.
xgawk has already been installed on your host machine,
a copy of the file should be in the directory of shared source
(which usually resides at a place like /usr/share/awk/
on GNU/Linux machines).
The file getXMLEVENT.awk as it is serves well if you want
to start writing a script from scratch. It already contains an
event-loop in the BEGIN pattern of the script.
Just take the main body of the event-loop (the while
loop) and change those parts that react on incoming events of the
XML event stream.
But in the remainder of this section, we will assume that we already have a script and we intend to port it. Attempting to describe the approach in the most useful way, we will go through two typical use-cases of the getXMLEVENT.awk template file. First we look at the necessary steps for taking an existing script written for XMLgawk and making it portable for use on Solaris machines (to name just the worst case scenario). Secondly, we go the other way round: take an existing portable script and describe the necessary steps for converting it into an XMLgawk script.
The general approach in porting a script that uses XMLgawk features to a portable script is always the same. No matter if we port the original outline script (see fig:ch2_outline.awk) or if we take a non-trivial application like the DTD generator (see Generating a DTD from a sample file). Now we proceed through the following series of steps.
XMLSTARTELEM action into the event loop.
END pattern of fig:dtd_generator.awk
verbatim after the event loop.
awk -f dtdgport.awk docbook_chapter.xml
It is amazing how simple and effective it is to turn an XMLgawk script into a portable script. After all, you should never forget about the limitations of the portable script. This tiny little XML parser is far from being a complete XML parser. Most notably, it misses the ability to read files with multi-byte characters and other Unicode encoding details. Experience tells us that sooner or later your tiny little parser will stumble across a customer- supplied XML file with special characters in it (copyright marks, typographic dashes, european accent characters, or even chinese characters). Then the need arises to port the script back to the full XMLgawk environment with its full XML parsing capability. When you eventually reach this point, continue reading the next subsection and you will find advice on porting your script back to XMLgawk.
Conversion of scripts from the portable subset to full XMLgawk
is even easier. This ease derives from the similarity of the
portable subset's event-loop with the API in
Style B - Reduced set of variables shared by all events
as described in the
XMLgawk Core Language Interface Summary.
The main point in porting is replacing the invocation of
getXMLEVENT with getline. Step through the
following task list and you will soon arrive at an application
that supports all subtleties of the XML data.
@load xml at the top
of the file.
BEGIN pattern, convert the condition in the while
statement of the event-loop.
while (getXMLEVENT(ARGV[1])) {
gets transformed into
while (getline > 0) {
BEGIN pattern with its event-loop unchanged.
getXMLEVENT, unescapeXML,
and closeXMLEVENT.
In How to traverse the tree with gawk, we have concentrated on the tree
structure of the XML file in Figure 1.3.
We found the two patterns XMLSTARTELEM and XMLENDELEM
which help us following the process of tree traversal. In this
chapter we will find out what the other XML-specific patterns
are. All of them will be used in example scripts and their meaning
will be described informally.
One of the advantages of using the XML format for storing data is that there are formalized methods of checking correctness of the data. Whether the data is written by hand or it is generated automatically, it is always advantageous to have tools for finding out if the new data obeys certain rules (is a tag misspelt ? another one missing ? a third one in the wrong place ?).
These mechanisms for checking correctness are applied at
different levels. The lowest level being well-formedness.
The next higher levels of correctness-check are the level of
the DTD (see Generating a DTD from a sample file)
and (even higher, but not required yet by standards)
the Schema. If you have a DTD (or Schema) specification for your
XML file, you can hand it over to a validation tool, which applies
the specification, checks for conformance and tells you the result.
A simple tool for validation against a DTD is
xmllint,
which is part of libxml and therefore installed on
most GNU/Linux systems. Validation against a Schema can be
done with more recent versions of xmllint or with
the xsv
tool.
There are two reasons why validation is currently not
incorporated into the gawk interpreter.
Here is a script for testing well-formedness of XML data.
The real work of checking well-formedness is done by the
XML parser incorporated into gawk. We are only
interested in the result and some details for error
diagnostic and recovery.
@load xml
END {
if (XMLERROR)
printf("XMLERROR '%s' at row %d col %d len %d\n",
XMLERROR, XMLROW, XMLCOL, XMLLEN)
else
print "file is well-formed"
}
As usual, the script starts with switching gawk
into XML mode. We are not interested in the content of
the nodes being traversed, therefore we have no action
to be triggered for a node. Only at the end (when the
XML file is already closed) we look at some variables
reporting success or failure. If the variable
XMLERROR ever contains anything other than 0
or the empty string, there is an error in parsing and
the parser will stop tree traversal at the place where
the error is. An explanatory message is contained in
XMLERROR (whose contents depends on the specific
parser used on this platform). The other variables in
the example contain the line number and the column in
which the XML file is formed badly.
When working with XML files, it is sometimes necessary to gain some oversight over the structure an XML file. Ordinary editors confront us with a view such as in Figure 1.2 and not a pretty tree view such as in Figure 1.3. Software developers are used to reading text files with proper indentation like the one in fig:ch2_dbfile.
book lang='en' id='hello-world'
bookinfo
title
chapter id='introduction'
title
para
sect1 id='about-this-book'
title
para
sect1 id='work-in-progress'
title
para
Figure 3.1: XML data (DocBook file) as a tree with proper indentation
Here, it is a bit
harder to recognize hierarchical dependencies among the
nodes. But proper indentation allows you to oversee files
with more than 100 elements (a purely graphical view of
such large files gets unbearable). Figure 3.1
was inspired by the tool outline that comes with
the Expat XML parser.
The outline tool produces such an indented output
and we will now write a script that imitates this kind
of output.
@load xml
XMLSTARTELEM {
printf("%*s%s", 2*XMLDEPTH-2, "", XMLSTARTELEM)
for (i=1; i<=NF; i++)
printf(" %s='%s'", $i, XMLATTR[$i])
print ""
}
Figure 3.2: outline.awk produces a tree-like outline of XML data
The script outline.awk in Figure 3.2 looks very similar to the
other scripts we wrote earlier, especially the script
max_depth.awk, which also traversed nodes and
remembered the depth of the tree while traversing. The
most important differences are in the lines with the
print statements. For the first time, we don't
just check if the XMLSTARTELEM variable contains
a tag name, but we also print the name out, properly indented
with a printf format statement (two blank characters
for each indentation level).
At the end of the description of the max_depth.awk
script in Figure 1.4 we already mentioned the
variable XMLDEPTH, which is used here as a replacement
of the depth variable. As a consequence, bookkeeping
with the depth variable in an action after the
XMLENDELEM is not necessary anymore. Our script has
become shorter and easier to read.
The other new phenomenon in this script is the associative
array XMLATTR. Whenever we enter a markup block
(and XMLSTARTELEM is non-empty), the array XMLATTR
contains all the attributes of the tag. You can find out the
value of an attribute by accessing the array with the attribute's
name as an array index. In a well-formed XML file, all the attribute
names of one tag are distinct, so we can be sure that each attribute
has its own place in the array. The only thing that's left to do is
to iterate over all the entries in the array and print name and value
in a formatted way. Earlier versions of this script really iterated
over the associative array with the for (i in XMLATTR)
loop. Doing so is still an option, but in this case we wanted to
make sure that attributes are printed in exactly the same oder
that is given in the original XML data. The exact order of attribute
names is reproduced in the fields $1 .. $NF. So the
for loop can iterate over the attributes names in the
fields $1 .. $NF and print the attribute values
XMLATTR[$i].
The script we are analyzing in this section produces
exactly the same output as the script in the previous
section. So, what's so different about it that
we need a second one ? It is the programming style which
is employed in solving the problem at hand. The previous
script was written so that the pattern XMLSTARTELEM
is positioned within the pattern.
This is ordinary AWK programming style, but it is not the way
users of other programming languages were brought up with. In
a procedural language, the software developer expects that he
himself determines control flow within a program. He writes
down what has to be done first, second, third and so on.
In the pattern-action model of AWK, the novice software
developer often has the oppressive feeling that
This feeling is characteristic for a whole class of programming environments. Most people would never think of the following programming environments to have something in common, but they have. It is the absence of a static control flow which unites these environments under one roof:
lex and yacc
tools, the main program only invokes a function yyparse()
and the exact control flow depends on the input source which
controls invocation of certain rules.
Within the context of XML, a terminology has been invented which distinguishes the procedural pull style from the event-guided push style. The script in the previous section was an example of a push-style script. Recognizing that most developers don't like their program's control flow to be pushed around, we will now present a script which pulls one item after the other from the XML file and decides what to do next in a more obvious way.
@load xml
BEGIN {
while (getline > 0) {
switch (XMLEVENT) {
case "STARTELEM": {
printf("%*s%s", 2*XMLDEPTH-2, "", XMLSTARTELEM)
for (i=1; i<=NF; i++)
printf(" %s='%s'", $i, XMLATTR[$i])
print ""
}
}
}
}
One XML event after the other is pulled out of the data
with the getline command. It's like feeling each grain
of sand pour through your fingers. Users who prefer this style
of reading input will also appreciate another novelty: The variable
XMLEVENT. While the push-style script in
Figure 3.2 used the event-specific variable
XMLSTARTELEM to detect the occurrence of a new XML element,
our pull-style script always looks at the value of the same
universal variable XMLEVENT to detect a new XML element.
We will dwell on a more detailed example in fig:testxml2pgsql.awk.2.
Formally, we have a script that consists of one BEGIN
pattern followed by an action which is always invoked. You
see, this is a corner case of the pattern-action model
which has been reduced so wide that its essence has disappeared.
Instead of the patterns you now see the cases of switch
statement, embedded into a while loop (for reading the
file item-wise).
Obviously, we have explicite conditionals now, instead of the
implicite ones we used formerly. The actions invoked within
the case conditions are the same we have seen in the
push approach.
All of the example scripts we have seen so far have one thing in common: they were only interested in the tree structure of the XML data. None of them treated the words between the tags. When working with files like the one in Figure 1.2, you are sometimes more interested in the words that are embedded in the nodes of Figure 1.3. XML terminology calls these words character data. In the case of a DocBook file one could call these words which are interspersed between the tags the payload of the whole document. Sometimes one is intersted in freeing this payload from all the useless stuff in angle brackets and extract the character data from the file. The structure of the document may be lost, but the bare textual content in ASCII is revealed and ready for importing it into an application software which does not understand XML.
Hello, world
Introduction
This is the introduction. It has two sections
About this book
This is my first DocBook file.
Warning
This is still under construction.
Figure 3.3: Example of some textual data from a DocBook file
You may wonder where the blank lines between the text lines come from. They are part of the XML file; each line break in the XML outside the tags (even the one after the closing angle bracket of a tag) is character data. The script which produces such an output is extremely simple.
@load xml
XMLCHARDATA { printf $0 }
Figure 3.4: extract_characters.awk extracts textual data from an XML file
Each time some character data is parsed, the XMLCHARDATA
pattern is set to 1 and the character data itself is stored into
the variable $0. A bit unusual is the fact that the text
itself is stored into $0 and not in XMLCHARDATA.
When working with text, one often needs the text split into fields
like AWK does it when the interpreter is not in XML mode. With
the words stored in fields $1 ... $NF, we now
have found a way to refer to isolated words again; it would be
easy to extend the script above so that it counts words like the
script wc.awk did.
Most texts are not as simple as Figure 3.3.
Textual data in computers is not limited to 26 characters and some
punctuation marks anymore. On all keyboards we have various kinds
of brackets (<, [ and {) and in Europe we have had things like the
ligature (Æ) or the umlaut (ü) for centuries. Having thousands
of symbols is not a problem in itself, but it became a problem when
software applications started representing these symbols with different
bytes (or even byte sequences). Today we have a standard for
representing all the symbols in the world with a byte sequence –
Unicode.
Unfortunately, the accepted standard came too late. Earlier
standardization efforts had created ways of representing subsets
of the complete symbol set, each subset containing 256 symbols
which could be represented by one byte. These subsets had names
which are still in use today (like ISO-8859-1 or IBM-852 or ISO-2022-JP).
Then came the programming language Java with a char data type
having 16 bits for each character. It turned out that 16 bits were
also not enough to represent all symbols. Having recognized the fixed
16 bit characters as a failure, the standards organizations finally
established the current Unicode standard. Today's Unicode character
set is a wonderful catalog of symbols – the book mentioned above
needs more than a 1000 pages to list them all.
And now to the ugly side of Unicode:
5.2 The Encoding DeclarationEvery XML document should have an encoding declaration as part of its XML declaration. The encoding declaration tells the parser in which character set the document is written. It's used only when other metadata from outside the file is not available. For example, this XML declaration says that the document uses the character encoding US-ASCII:
<?xml version="1.0" encoding="US-ASCII" standalone="yes"?>This one states that the document uses the Latin-1 character set, though it uses the more official name ISO-8859-1:
<?xml version="1.0" encoding="ISO-8859-1"?>Even if metadata is not available, the encoding declaration can be omitted if the document is written in either the UTF-8 or UTF-16 encodings of Unicode. UTF-8 is a strict superset of ASCII, so ASCII files can be legal XML documents without an encoding declaration.
Several times a character set name is assigned to an encoding declaration – the book does it and the XML samples do it too. Only in the last paragraph the usage of terms is clean: UTF-8 is the default way of encoding a character into a byte sequence.
After this unpleasant excursion into the cultural history of
textual data in occidental societies, let's get back to gawk
and see how the concepts of the encoding and the character set
are incorporated into the language.
Three variables are all that you need to know, but each of
them comes from a different context. Take care that you recognize
the difference between the XML document, gawk's internal
data handling and the influence of an environment variable from
the shell environment setting the locale.
XMLATTR["ENCODING"] is a pattern variable that (when non-empty
and XMLDECLARATION is triggered)
contains the name of the character encoding in which the XML file
was originally encoded. This information comes from the first
line of the XML file (if the line contains the usual XML header).
There is no use in overwriting this variable, the variable is
meant to tell you what's in the XML data and nothing happens
when you change XMLATTR["ENCODING"].
XMLCHARSET is the variable to change if you want to see
the XML data converted to a character set of your own choice. When
you set this variable, the gawk interpreter will remember
the character set of your choice. But this choice will take effect
only upon opening the next file. A change of XMLCHARSET
will not influence XML data from a file that has already been
opened earlier for reading.
LANG is an environment variable of your operating system.
It tells the gawk interpreter which value to use for
XMLCHARSET on initial startup when nothing has been said
about it in the user's script. In the absence of any setting for
LANG, US-ASCII is used as the default encoding.
Up to now, we have always talked about the encoding and character
set of the data to be processed. Remember that the source
code of your program is also written in some character set. It is
usually the LANG character set that is used while writing
programs. Imagine what happens when you have a program containing
a character from your native character set, for which there is no
encoding in the character set used at run-time. The alert reader
will notice how consequent gawk is in following the Unicode
tradition of mixing up character encoding and character set.
After so much scholastic reasoning, you might be inclined
to presume that character sets and encodings are hardly of
any use in real life (except for befuddling the novice).
The following example should dispel your doubts.
In real life, circumstance transcending sensible reasoning
could require you to import the text in Figure 3.3
into a Microsoft Windows application. Contemporary flavours of
Microsoft Windows prefer to store textual data in UTF-16.
So, a script for converting the text to UTF-16 would
be a nice tool to have – and you already have such a tool.
The script extract_characters.awk in Figure 3.4
will do the job, if you tell the gawk interpreter to use
the UTF-16 encoding when reading the DocBook file.
Two alternatives ways of reaching this target arise:
XMLCHARSET to UTF-16.
After invocation, the gawk interpreter will now print
the same data as in Figure 3.3, but converted
to UTF-16.
BEGIN { XMLCHARSET="utf-16" }
gawk
interpreter, set the environment variable LANG to UTF-16.
Earlier in this chapter we have seen that gawk does
not validate XML data against a DTD. The declaration of a document
type in the header of an XML file is an optional part of the data,
not a mandatory one. If such a declaration is present (like it is
in Figure 1.2), the reference to the DTD will not be
resolved and its contents will not be parsed. However, the presence
of the declaration will be reported by gawk. When the declaration
starts, the variable XMLSTARTDOCT contains the name of the
root element's tag; and later, when the declaration ends, the variable
XMLENDDOCT is set to 1. In between, the array variable XMLATTR
will be populated with the values of the public identifier
of the DTD (if any) and the value of the system's identifier of the DTD
(if any). Other parts of the declaration (elements, attributes and entities)
will not be reported.
@load xml
XMLDECLARATION {
version = XMLATTR["VERSION" ]
encoding = XMLATTR["ENCODING" ]
standalone = XMLATTR["STANDALONE" ]
}
XMLSTARTDOCT {
root = XMLSTARTDOCT
pub_id = XMLATTR["PUBLIC" ]
sys_id = XMLATTR["SYSTEM" ]
intsubset = XMLATTR["INTERNAL_SUBSET"]
}
XMLENDDOCT {
print FILENAME
print " version '" version "'"
print " encoding '" encoding "'"
print " standalone '" standalone "'"
print " root id '" root "'"
print " public id '" pub_id "'"
print " system id '" sys_id "'"
print " intsubset '" intsubset "'"
print ""
version = encoding = standalone = ""
root = pub_id = sys_id = intsubset ""
}
Figure 3.5: db_version.awk extracts details about the DTD from an XML file
Most users can safely ignore these variables if they are only interested in the data itself. But some users may take advantage of these variables for checking requirements of the XML data. If your data base consists of thousands of XML file of diverse origins, the public identifier of their DTDs will help you gain an oversight over the kind of data you have to handle and over potential version conflicts. The script in Figure 3.5 will assist you in analyzing your data files. It searches for the variables mentioned above and evaluates their content. At the start of the DTD, the tag name of the root element is stored; the identifiers are also stored and finally, those values are printed along with the name of the file which was analyzed. After each DTD, the remembered values are set to an empty string until the DTD of the next file arrives.
In fig:ch2_DTD_details you can see an example output of
the script in Figure 3.5. The first entry is the
file we already know from Figure 1.2. Obviously, the first
entry is a DocBook file (English version 4.2) containing a
book element which has to be validated against a local
copy of the DTD at CERN in Switzerland. The second file is a
chapter element of DocBook (English version 4.1.2) to
be validated against a DTD on the Internet. Finally, the third
entry is a file describing a project of the GanttProject application.
There is only a tag name for the root element specified, a DTD
does not seem to exist.
data/dbfile.xml
version ''
encoding ''
standalone ''
root id 'book'
public id '-//OASIS//DTD DocBook XML V4.2//EN'
system id '/afs/cern.ch/sw/XML/XMLBIN/share/www.oasis-open.org/docbook/xmldtd-4.2/docbookx.dtd'
intsubset ''
data/docbook_chapter.xml
version ''
encoding ''
standalone ''
root id 'chapter'
public id '-//OASIS//DTD DocBook XML V4.1.2//EN'
system id 'http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd'
intsubset ''
data/exampleGantt.gan
version '1.0'
encoding 'UTF-8'
standalone ''
root id 'ganttproject.sourceforge.net'
public id ''
system id ''
intsubset ''
Figure 3.6: Details about the DTDs in some XML files
You may wish to make changes to this script if you need it in
daily work. For example, the script currently reports nothing
for files which have no DTD declaration in them. You can easily
change this by appending an action for the END rule which
reports in case all the variables root, pub_id
and sys_id are empty. As it is, the script parses the
entire XML file, although the DTD is always positioned at the
top, before the root element. Parsing the root element is
unnecessary and you can improve the speed of the script significantly
if you tell it to stop parsing when the first element (the root
element) comes in.
XMLSTARTELEM { nextfile }
If you have read this web page sequentially until now, you have understood how to read an XML file and treat it as a tree. You also know how to handle different character encodings and DTD declarations. This section is meant to give you an overview of what other patterns there are when you work with XML files. The overview is meant to be complete in the sense that you will see the name of every pattern involved and an example of usage. Conceptually, you will not see much new material, this is only about some new variables for passing information from the XML file. Here are the new patterns:
XMLATTR["VERSION"].
The following script is meant to demonstrate all XML patterns
and variables. It can help you while you are debugging other
scripts because this script will show you everything that is
in the XML file and how it is read by gawk.
@load xml
# Set XMLMODE so that the XML parser reads strictly
# compliant XML data. Convert characters to Latin-1.
BEGIN { XMLMODE=1 ; XMLCHARSET = "ISO-8859-1" }
# Print an outline of nested tags and attributes.
XMLSTARTELEM {
printf("%*s%s", 2*XMLDEPTH-2, "", XMLSTARTELEM)
for (i=1; i<=NF; i++)
printf(" %s='%s'", $i, XMLATTR[$i])
print ""
}
# Upon closing tag, XMLPATH still holds the tag name.
XMLENDELEM { printf("%s %s\n", "XMLENDELEM", XMLPATH) }
# XMLEVENT holds the name of the current event.
XMLEVENT { print "XMLEVENT", XMLEVENT, XMLNAME, $0 }
# Character data will not be lost.
XMLCHARDATA { print "XMLCHARDATA", $0 }
# Processing instruction and comments instructions will be reported.
XMLPROCINST { print "XMLPROCINST", XMLPROCINST, $0 }
XMLCOMMENT { print "XMLCOMMENT", $0 }
# CDATA sections are used for quoting verbatim text.
XMLSTARTCDATA { print "XMLSTARTCDATA" }
# CDATA blocks have an end that is reported.
XMLENDCDATA { print "XMLENDCDATA" }
# The very first event holds the version info.
XMLDECLARATION {
version = XMLATTR["VERSION" ]
encoding = XMLATTR["ENCODING" ]
standalone = XMLATTR["STANDALONE"]
}
# DTDs, if present, are indicated as such.
XMLSTARTDOCT {
root = XMLSTARTDOCT
print "XMLATTR[PUBLIC]", XMLATTR["PUBLIC"]
print "XMLATTR[SYSTEM]", XMLATTR["SYSTEM"]
print "XMLATTR[INTERNAL_SUBSET]", XMLATTR["INTERNAL_SUBSET"]
}
# The end of a DTD is also indicated.
XMLENDDOCT { print "root", root }
# Unparsed text occurs rarely.
XMLUNPARSED { print "XMLUNPARSED", $0 }
# XMLENDDOCUMENT occurs only with XML data that is not
# strictly compliant to standards (multiple root elements).
XMLENDDOCUMENT { print "XMLENDDOCUMENT" }
# At the end of the file, you can check if an error occurred.
END { if (XMLERROR)
printf("XMLERROR '%s' at row %d col %d len %d\n",
XMLERROR, XMLROW, XMLCOL, XMLLEN)
}
Figure 3.7: The script demo_pusher.awk demonstrates all variables of XMLgawk
All the variables that were added to the AWK language to allow for reading XML files show you one event at a time. If you want to rearrange data from several nodes, you have to collect the data during tree traversal. One example for this situation is the name of the parent node which is needed several time in the examples of Some Advanced Applications.
Stefan Tramm has written the xmllib library because he wanted
to simplify the use of gawk for command line usage (one-liners).
His library comes as an ordinary script file with AWK code and is
automatically included upon invocation of xmlgawk.
It introduces new variables for easy handling of character data
and tag nesting. Stefan contributed the library as well as the
xmlgawk wrapper script.
FIXME: This chapter has not been written yet.
Even with the xmllib, random access to nodes in the tree is not possible. There are a few applications which need access to parent and child elements and sometimes even remote places in the tree. That's why Manuel Collado wrote the xmltree library.
Manuel's xmltree reads an XML file at once and stores it entirely. This approach is called the DOM approach. Languages like XSL inherently assume that the DOM is present when executing a script. This is, at once, the strength (random access) and the weakness (holding the entire file in memory) of these languages. Manuel contributed the xmltree library.
FIXME: This chapter has not been written yet.
This chapter is a collection of XML related problems which were posted in newsgroups on the Internet. After a citation of the original posting and a short outline of the problem, each of these problems is followed by a solution in XMLgawk. Although we take care to find an exact solution to the original problem, we are not really interested in the details of any of these problems. What we are interested in is a demonstration of how to attack problems of this kind in general. The raison d'être for this chapter is manifold:
The original poster of this problem wanted to find all tags
which have an attribute of a specific kind (i="Y")
and produce the value of another attribute as output. He
described the problem as follows with an input/output
relationship:
suppose i have:
<a>
<b i="Y" j="aaaa"/>
<c i="N" j="bbbb"/>
<d i="Y" j="cccc"/>
<e i="N" j="dddd"/>
<f i="N" j="eeee"/>
<g i="Y" j="ffff"/>
</a>
and i want to extract the elements where i="Y" such that i get something like
<x>
<y>1. aaaa</y>
<y>2. cccc</y>
<y>3. gggg</y>
</x>
how would i get the numbering to work across the different elements?
He probably had XML data from an input form with two fields.
The first field containing the answer to an alternative choice
(Y/N) and the second field containing some description. The goal
was to extract the specific description for all positive answers
(i="Y"). All the output data had to be embedded into nested
tags (x and y). The nesting of the tags explains the
print commands in the BEGIN and the END patterns
of the following solution.
@load xml
BEGIN { print "<x>" }
XMLSTARTELEM {
if (XMLATTR["i"] == "Y")
print " <y>" ++n ". " XMLATTR["j"] "</y>"
}
END { print "</x>" }
An XMLSTARTELEM pattern triggers the printing of the
y output tags. But only if the attribute i has
the value Y will an output be printed. The output itself
consists of the value of the attribute j and is embedded
into y tags.
If you try the script above on the input data supplied by
the original poster, you will notice that the resulting
output differs slightly from the desired output given above.
There is obviously a typo in the third item of the output
(gggg instead of ffff).
Problems of this kind (input data is XML and output dats is
also XML) are usually solved with the XSL language. From this
example we learn that XMLgawk is an adequate tool for reading
the input data. But producing the tagged structure of the output
data (with simple print commands) is not as elegant as
some users like it.
This problem differs from the previous one in the kind of output data to be produced. Here we produce tabbed ASCII output from an XML input file. The original poster of the question had XML data in the XMLTV format. XMLTV is a format for storing your knowledge about which TV program (or TV programme in British English) will be broadcast at which time on which channel. The original poster gives some example data (certainly not in the most readable form).
To help me get my head around XMLGAWK can someone solve the following.
I have a XMLTV data file from which I want to extract certain data and
write to a tab-delimited flat file.
The XMLTV data is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<tv><programme start="20041218204000 +1000" stop="20041218225000
+1000" channel="Network TEN Brisbane"><title>The
Frighteners</title><sub-title/><desc>A psychic private detective, who
consorts with deceased souls, becomes engaged in a mystery as members
of the town community begin dying mysteriously.</desc><rating
system="ABA"><value>M</value></rating><length
units="minutes">130</length><category>Horror</category></programme><programme
start="20041218080000 +1000" stop="20041218083000 +1000"
channel="Network TEN Brisbane"><title>Worst Best
Friends</title><sub-title>Better Than Glen</sub-title><desc>Life's
like that for Roger Thesaurus - two of his best friends are also his
worst enemies!</desc><rating
system="ABA"><value>C</value></rating><length
units="minutes">30</length><category>Children</category></programme></tv>
The flate file needs to be as follows:
channel<tab>programme
start<tab>length<tab>title<tab>description<tab>rating value
So the first record would read:
Network TEN Brisbane<tab>2004-12-18 hh:mm<tab>130<tab>The
Frighteners<tab>A psychic private detective, who consorts with
deceased souls, becomes engaged in a mystery as members of the town
community begin dying mysteriously.<tab>M
So, he wants an ASCII output line for each node of kind programme.
The proper outline of his example input looks like this:
tv
programme channel='Network TEN Brisbane' start='20041218204000 +1000' stop='20041218225000+1000'
title
sub-title
desc
rating system='ABA'
value
length units='minutes'
category
programme channel='Network TEN Brisbane' start='20041218080000 +1000' stop='20041218083000 +1000'
title
sub-title
desc
rating system='ABA'
value
length units='minutes'
category
Printing the desired output is not as easy as in the previous
section. Here, much data is stored as character data in
the nodes and only a few data items are stored as attributes.
In XMLgawk it is much easier to work with attributes than with
character data. This sets XMLgawk apart from XSL, which treats
both kinds of data in a more uniform way.
In the action after the BEGIN pattern we can see
how easy it is to produce tabbed ASCII output (i.e.
separating output fields with TAB characters):
just set the OFS variable to "\t".
Another easy task is to collect the information about
the channel and the start time of a program on TV.
These are stored in the attributes of each programme
node. So, upon entering a programme node, the
attributes are read and their content stored for later
work. Why can't we print the output line immediately
upon entering the node ? Because the other data bits
(length, title and description) follow later in nested
nodes. As a consequence, data collection is completed
only when we are leaving the programme node.
Therefore, the printing of tabbed output happens in the
action after the XMLENDELEM == "programme" pattern.
@load xml
BEGIN { OFS= "\t" }
XMLSTARTELEM == "programme" {
channel = XMLATTR["channel"]
start = XMLATTR["start"]
data = ""
}
XMLCHARDATA { data = $0 }
XMLENDELEM == "desc" { desc = data }
XMLENDELEM == "length" { leng = data }
XMLENDELEM == "title" { title = data }
XMLENDELEM == "value" { value = data }
XMLENDELEM == "programme" {
print channel, substr(start,1,4) "-" substr(start,5,2) "-" substr(start,7,2) " " \
substr(start,9,2) ":" substr(start,11,2), leng, title, desc, value
desc = leng = title = value = ""
}
What's left to do is collecting character data.
Each time we come across some character data, we
store it in a variable data for later retrieval.
At this moment we don't know yet what kind of character
data this is. Only later (when leaving the desc,
length, title or value node) can we
assign the data to its proper destination. This kind of
deferred assignment of character data is typical
for XML parsers following the streaming approach:
they see only one data item at a time and the user has
to take care of storing data bits needed later. XML Transformation
languages like XSL don't suffer from this shortcoming.
In XSL you have random access to all information in the
XML data. It is up to the user to decide if the problem at
hand should be solved with a streaming parser (like XMLgawk)
or with a DOM parser
(like XSL). If you want to use XMLgawk
and still enjoy the comfort of easy handling of character
data, you should use the xmllib (see Some Convenience with the xmllib library)
or the xmltree (see DOM-like access with the xmltree library)
library described elsewhere.
Up to now we have seen examples whose main concern was finding
and re-formatting of XML input data. But sometimes reading and
printing is not enough. The original poster of the following
example needs the shortest value of attribute value in
all month tags. He refers explicitly to a solution in XSL
which he tried, mentioning some typical problem he had with XSL
templates.
I'm trying to find the minimum value of a set of data (see below).
I want to compare the lengths of these attribute values and display
the lowest one.
This would be simple if I could re-assign values to a variable,
but from what I gather I can't do that. How do I keep track of the
lowest value as I loop through? My XSL document only finds the length
of each string and prints it out (for now). I can write a template
that calls itself for recursion, but I don't know how to keep the
minimum value readially available as I go through each loop.
Thanks,
James
XML Document
=============================
<calendar name="americana">
<month value="January"/>
<month value="February"/>
<month value="March"/>
<month value="April"/>
<month value="May"/>
<month value="June"/>
<month value="July"/>
<month value="August"/>
<month value="September"/>
<month value="October"/>
<month value="November"/>
<month value="December"/>
</calendar>
The solution he looks for is the value May.
Simple minds like ours simply go through the list of month
tags from top to bottom, always remembering the shortest value
found up to now. Having finished the list, the remembered value
is the solution. Look at the following script and you will find
that it follows the path of our simple-minded approach.
@load xml
XMLSTARTELEM == "month" {
# Initialize shortest
if (shortest == "")
shortest = XMLATTR["value"]
# Find shortest value
if (length(XMLATTR["value"]) < length(shortest))
shortest = XMLATTR["value"]
}
END { print shortest }
A solution in XSL is not as easy as this. XSL is a functional language, as such being mostly free from programming concepts like the variable. It is one of the strengths of functional languages that they are mostly free from side-effects and global variables containing values are (conceptually speaking) side-effects. Therefore, a solution in XSL employs the use of so-called templates which invoke each other recursively.
Examples like this shed some light on the question why XSL is so different from other languages and therefore harder to learn for most of us. As can be seen from this simple example, the use of recursion is unavoidable in XSL. Even for the simplest of all tasks. As a matter of fact, thinking recursively is not the way most software developers prefer to work in daily life. Ask them. When did you use recursion for the last time in your C or C++ or AWK programs ?
A few months after I wrote Generating a DTD from a sample file, someone posted a request for a similar tool in the newsgroup comp.text.xml.
A few years ago my department defined a DTD for a projected class of
documents. Like the US Constitution, this DTD has details that are
never actually used, so I want to clean it up. Is there any tool that
looks at existing documents and compares with the DTD they use?
[I can think of other possible uses for such a tool, so I thought
someone might have invented it. I have XML Spy but do not see a feature
that would do this.]
What the original poster needs is a tool for reading a DTD and finding out if the sample files actually use all the parts of the DTD. This is not exactly what the DTD generator in Generating a DTD from a sample file does. But it would be a practical solution to let the DTD generator produce a DTD for the sample files and compare the produced DTD with the old original DTD file.
Someone else posted an alternative solution, employing a bunch of tools from the Unix tool set:
I did this as part of a migration from TEI SGML to XML. Basically:
a) run nsgmls over the documents and produce ESIS
b) use awk to extract the element type names
c) sort and uniq them
d) use Perl::SGML to read the DTD and list the element type names
e) sort them
f) caseless join the two lists with -a to spit out the non-matches
If you're not using a Unix-based system, I think Cygwin can run these tools.
Whatever solution you prefer, these tools serve the user well on the most popular platforms available.
Most programming languages today offer some support for reading XML files.
But unlike XMLgawk, most other languages map the XML file to a
tree-like memory-resident data structure. This allows for convenient access of all elements
of the XML file in any desired order; not just sequentially one-at-a-time
like in XMLgawk. One user of such a language came up with a common problem in the
newsgroup comp.text.xml and asked for a solution.
When reading the following XML data, notice the two item elements
containing structurally similar sub-elements. Each item has a
PPrice and a BCQuant sub-element, containg price and quantity
of the items. The user asked
I have an XML like this:
<?xml version="1.0" encoding="UTF-8"?>
<invoice>
<bill>
<BId>20</BId>
<CId>73</CId>
<BDate>2006-01-10</BDate>
<BStatus>0</BStatus>
</bill>
<billitems>
<item>
<PName>Kiwi</PName>
<PPrice>0.900</PPrice>
<PId>1</PId>
<BCQuant>15</BCQuant>
</item>
<item>
<PName>Apfel</PName>
<PPrice>0.500</PPrice>
<PId>3</PId>
<BCQuant>10</BCQuant>
</item>
</billitems>
</invoice>
Now I want to have the sum of /invoice/billitems/item/BCQuant * /invoice/billitems/item/PPrice
(=total price)
His last sentence sums it all up: He wants the total cost over all
items, yielding the summation of the product of
PPrice and BCQuant. He identifies the variables to
be multiplied with paths which resemble file names in a
Unix file system. The notation
/invoice/billitems/item/BCQuant * /invoice/billitems/item/PPrice
is quite a convenient way of addressing variables in an XML document.
Some programming languages allow the user to apply this notation
directly for addressing variables. For users of these languages it
is often hard to adjust their habits to XMLgawk's way of
tackling a problem. In XMLgawk, it is not possible
to use such paths for direct access to variables. But it is possible to
use such paths in AWK patterns for matching the current location in
the XML document. Look at the following solution and you will understand
how to apply paths in XMLgawk. The crucial point to understand
is that there is a predefined variable XMLPATH which always
contains the path of the location which is currently under observation.
The very first line of the solution is the basis of access to the
variables PPrice and BCQuant. Each time some character
data is read, the script deposits its content into an associative array
data with the path name of the variable as the index into
the array. As a consequence, this associative array data maps
the variable name (/invoice/billitems/item/BCQuant) to its value
(15), but only for the short time interval when one XML element
item is being read.
@load xml
XMLCHARDATA { data[XMLPATH] = $0 }
XMLENDELEM == "item" {
sum += data["/invoice/billitems/item/BCQuant"] * \
data["/invoice/billitems/item/PPrice" ]
}
END { printf "Sum = %.3f\n",sum }
The summation takes places each time when the reading of one element
item is completed; when XMLENDELEM == "item". At this
point in time the quantity and the price have definitely been stored
in the array data. After completion of the XML document, the
summation process has ended and the only thing left to do is printing
the result.
This simple technique (mapping a path to a value with data[XMLPATH] = $0)
is the key to later accessing data somewhere else in the tree.
Notice the subtle difference between languages like XSL which store the complete
XML document in a tree (DOM) and XMLgawk. With XMLgawk
only those parts of the tree are stored in precious memory which are
really necessary for random access. The only inconvenience is that the
user has to identify these parts himself and store the data explicitly.
Other languages will do the storage implicitly (without writing any code),
but the user has abandoned control over the granularity of data storage.
After a detailed analysis you might find a serious limitation in this
simple approach. It only works for a character data block inside a
markup block when
there is no other tag inside this markup block. In other words:
Only when the node in the XML tree is a terminal node (a leaf, like
number 3, 5, 6, 8, 9, 11, 12 in Figure 1.3 and Figure 1.2),
will character data be stored in data[XMLPATH] as expected.
If you are also interested in accessing character data of non-terminal
nodes in the XML tree (like number 2, 4, 7, 10), you will need a more
sophisticated approach:
@load xml
XMLSTARTELEM { delete data[XMLPATH] }
XMLCHARDATA { data[XMLPATH] = data[XMLPATH] $0 }
The key difference is that the last line now successively accumulates
character data blocks of each non-terminal node while going through the XML tree.
Only after starting to read another node of the same kind (same tag name,
a sibling) will the accumulated character data be cleared. Clearing
is really necessary, otherwise the character data of all nodes of same kind
and depth would accumulate. This kind of accumulation is undesirable because
we expect character data in one data[XMLPATH] to contain only
the text of one node and not the text of other nodes at the same nesting level.
But you are free to adapt this behavior to your needs, of course.
Unlike the previous chapter, this chapter really provides complete application programs doing non-trivial work. Inspite of the sophisticated nature of the tasks, the source code of some of these applications still fits onto one page. But most of the source code had to be split up into two or three parts.
XML data is traditionally associated with Internet applications because this data looks so similar to the HTML data used on the Internet. But after more than 10 years of use, the XML data format has been found to be useful in many more areas, which have different needs. For example, data measured periodically in remote locations is nowadays often encoded as XML data. It is not only the improved readability of XML data (as opposed to proprietary binary formats of the past), but also the tree structure of tagged XML data that is so pleasant for users. If you need to add one more measured data type to your format, no problem, the measuring device just adds an XML attribute or another tag to the XML data tree. The application software reading the data can safely ignore the additional data type and is still able to read the new data format. So far so good. But then a device sends us data like the one in fig:remote_data.xml.
<?xml version="1.0"?>
<MONITORINGSTATIONS>
<C_CD>ES</C_CD>
<DIST_CD>ES100</DIST_CD>
<NAME>CATALUÑA</NAME>
<MONITORINGSTATION>
<EU_CD>ES_1011503</EU_CD>
<MS_CD>1011503</MS_CD>
<LON>0,67891</LON>
<LAT>40.98765</LAT>
<PARAMETER>Particulate Matter < 10 µm</PARAMETER>
<STATISTIC>Days with maximum value > 100 ppm</STATISTIC>
<VALUE>10</VALUE>
<URL>http://www.some.domain.es?query=1&argument=2</URL>
</MONITORINGSTATION>
</MONITORINGSTATIONS>
Figure 7.1: The file remote_data.xml contains data measured in a remote location
If you skim over the data, you might find three places that look odd:
What we need is a script that rectifies these quirks on the XML data while leaving the tree structure untouched. To be more precise: We need a script that
@include xmlcopy
BEGIN { XMLCHARSET = "ISO-8859-1" }
XMLATTR["VERSION"] { XMLATTR["ENCODING"] = XMLCHARSET }
XMLCHARDATA && XMLPATH ~ /\/LON$/ { gsub(",", ".") }
{ XmlCopy() }
Figure 7.2: The script modify.awk slightly modifies the XML data
All the magic of evaluating the faked encoding name and copying everything is done inside the library script. Just like the getXMLEVENT.awk that was explained in A portable subset of XMLgawk, this library script may be found in one of several places:
xgawk distribution file contains a copy in the
awklib/xml directory.
xgawk has already been installed on your host machine,
a copy of the file should be in the directory of shared source
(which usually resides at a place like /usr/share/awk/
on GNU/Linux machines).
Have a look at your copy of the xmlcopy.awk library script.
Notice that the script contains nothing but the declaration of
the XmlCopy() function. Also notice that the function
gets invoked only after all manipulations on the data have been done.
The result of a successful run can be seen in fig:modify.xml.
Shortly after opening the input file, an XMLEVENT of type
DECLARATION occurs, but there is no XMLATTR["ENCODING"]
variable set because the input data doesn't contain such a declaration.
That's the place where our script in Figure 7.2 comes in and sets
the declaration in the right moment. So, the XmlCopy() function
will happily print an encoding name.
gawk -f modify.awk remote_data.xml
<?xml version="1.0" encoding="ISO-8859-1"?>
<MONITORINGSTATIONS>
<C_CD>ES</C_CD>
<DIST_CD>ES100</DIST_CD>
<NAME>CATALUÑA</NAME>
<MONITORINGSTATION>
<EU_CD>ES_1011503</EU_CD>
<MS_CD>1011503</MS_CD>
<LON>0.67891</LON>
<LAT>40.98765</LAT>
<PARAMETER>Particulate Matter < 10 μm</PARAMETER>
<STATISTIC>Days with maximum value > 100 ppm</STATISTIC>
<VALUE>10</VALUE>
<URL>http://www.some.domain.es?query=1&argument=2</URL>
</MONITORINGSTATION>
</MONITORINGSTATIONS>
Figure 7.3: The output data of modify.awk is slightly modified
The Internet is a convenient source of news data. Most of the times we use a browser to read the HTML files that are transported via the HTTP protocol to us. But sometimes there is no browser at hand or we don't want the news data to be visualized immediately. A news ticker displaying news headings in a very small window is an example. For such cases, a special news format has been established, the RSS format. The protocol for transporting the data is still HTTP, but now the content is not HTML anymore, but XML with a simple structure (see fig:rss_inq for an example). The root node in the example tells us that we have received data structured according to version 0.91 of the RSS specification. Node number 2 identifies the news channel to us; augmented by its child nodes 3, 4 and 5, which contain title, link and description of the source. But we are not interested in these; we are interested in the titles of the news items. And those are contained in a sequence of nodes like node 6 (only one of them being depicted here).
What we want as textual output is a short list of news titles –
each of them numbered, titled and linked like in fig:table_inq.
How can we collect the data for each news item while traversing all the nodes
and how do we know when we have finished collecting data of one item
and we are ready to print ? The idea is to wait for the end of an item such as
node 6. Notice that the tree is parsed depth-first, so when leaving
node 6 (when pattern XMLENDELEM == "item" triggers), its child nodes 7 and 8
have already been parsed earlier. The most recent data in nodes 7 and 8
contained the title and the link to be printed.
You may ask How do we access data of a node that has already been traversed earlier ?
The answer is that we store textual data in advance (when XMLCHARDATA triggers).
At that moment we don't know yet if the stored data is title or link, but
when XMLENDELEM == "title" triggers, we know that the data was a title
and we can remember it as such. I know this sounds complicated and you definitely
need some prior experience in AWK or event-based programming to grasp it.
If you are confused by these explanations, you will be delighted
to see that all this mumbling is contained in just 4 lines of code
(inside the while loop). It is a bit surprising that these
4 lines are enough to select all the news items from the tree
and ignore nodes 3, 4 and 5. How do we manage to ignore node
3, 4 and 5 ? Well, actually we don't ignore them. They are
title and link nodes and their content is stored
in the variable data. But the content of nodes 3 and 4
never gets printed because printing happens only when leaving
a node of type item.
1. Playing the waiting game http://www.theinquirer.net/?article=18979
2. Carbon-dating the Internet http://www.theinquirer.net/?article=18978
3. LCD industry walking a margin tightrope http://www.theinquirer.net/?article=18977
4. Just how irritating can you get? http://www.theinquirer.net/?article=18976
5. US to take over the entire Internet http://www.theinquirer.net/?article=18975
6. AMD 90 nano shipments 50% by year end http://www.theinquirer.net/?article=18974
Figure 7.5: These are news titles from an RSS news feed
It turns out that traversing the XML file is the easiest part. Retrieving the file from the Internet is a bit more complicated. It would have been wonderful if the news data from the Internet could have been treated as XML data at the moment it comes pouring in hot off the rumour mill. But unfortunately, the XML data comes with a header, which does not follow the XML rules – it is an HTTP header. Therefore, we first have to swallow the HTTP header, then read all the lines from the news feed as ASCII lines and store them into a temporary file. After closing the temporary file, we can re-open the file as an XML file and traverse the news nodes as described above.
@load xml
BEGIN {
if (ARGC != 3) {
print "get_rss_feed - retrieve RSS news via HTTP 1.0"
print "IN:\n host name and feed as a command-line parameter"
print "OUT:\n the news content on stdout"
print "EXAMPLE:"
print " gawk -f get_rss_feed.awk www.TheInquirer.Net inquirer.rss"
print "JK 2004-10-06"
exit
}
host = ARGV[1]; ARGV[1] = ""
feed = ARGV[2]; ARGV[2] = ""
# Switch off XML mode while reading and storing data.
XMLMODE=0
# When connecting, use port number 80 on host
HttpService = "/inet/tcp/0/" host "/80"
ORS = RS = "\r\n\r\n"
print "GET /" feed " HTTP/1.0" |& HttpService
HttpService |& getline Header
# We need a temporary file for the XML content.
feed_file="feed.rss"
# Make feed_file an empty file.
printf "" > feed_file
# Put each XML line into feed_file.
while ((HttpService |& getline) > 0)
printf "%s", $0 >> feed_file
close(HttpService) # this is optional since connection is empty
close(feed_file) # this is essential since we re-read the file
# Read feed_file (XML) and print a simplified summary (ASCII).
XMLMODE=1
XMLCHARSET="ISO-8859-1"
# While printing, use \n as line separator again.
ORS="\n"
while ((getline < feed_file) > 0) {
if (XMLCHARDATA ) { data = $0 }
if (XMLENDELEM == "title" ) { title = data }
if (XMLENDELEM == "link" ) { link = data }
if (XMLENDELEM == "item" ) { print ++n ".\t" title "\t" link }
}
}
You can find more info about the data coming from RSS news feeds in the fine article What is RSS. Digging deeper into details you will find that there are many similar structural definitions which all call themselves RSS, but have differing content. Our script above was written in such a way to make sure that the script understands all different RSS sources, but this could only be achieved at the expense of leaving details out.
There is another problem with RSS feeds. For example, Yahoo also offers RSS news feeds. But if you use the script above for retrieval, Yahoo will send HTML data and not proper XML data. This happens because the RSS standards were not properly defined and Yahoo's HTTP server does not understand our request for RSS data.
In Reading an RSS news feed we have seen how a simple
service on the Internet can be used. The request to the service
was a single line with the name of the service. Only the response
of the server consisted of XML data. What if the request itself
contains several parameters of various types, possibly containing
textual data with newline characters or foreign symbols ? Classical
services like Yahoo's stock quotes have found a way to pass tons
of parameters by appending the parameters to the GET line
of the HTTP request. Practice has shown that such overly long
GET lines are not only awkward (which we could accept)
but also insufficient when object oriented services are needed.
The need for a clean implementation of object oriented services
was the motivation behind the invention of the
SOAP protocol. Instead of compressing
request parameters into a single line, XML encoded data is used for passing
parameters to SOAP services. SOAP still uses HTTP for transportation,
but the parameters are now transmitted with the POST method of HTTP
(which allows for passing data in the body of the request, unlike the
GET method).
In this section we will write a client for a SOAP service. You can find a very short and formalized description of the Barnes & Noble Price Quote service on the Internet. The user can send the ISBN of a book to this service and it will return him some XML data containing the price of the book. You may argue that this example service needs only one parameter and should therefore be implemented without SOAP and XML. This is true, but the SOAP implementation is good enough to reveal the basic principles of operation. If you are not convinced and would prefer a service which really exploits SOAP's ability to pass structured data along with the request, you should have a look at a list on the Internet, which presents many publicly available SOAP services. I urge you to look this page up, it is really enlightening what you can find there. Anyone interested in the inner working of more complex services should click on the Try it link of an example service. Behind the Try it link is some kind of debugger for SOAP requests, revealing the content of request and response in Pseudocode, raw, tree or interactive XML rendering. I have learned much from this SOAPscope. The author of the Barnes & Noble Price Quote service (Mustafa Basgun) has also written a client for his service. In a fine article on the Internet, the author described how he implemented a GUI-based client interface with the help of Flash MX. From this article, we take the following citation, which explains some more details about what SOAP is:
Simple Object Access Protocol (SOAP) is a lightweight protocol for exchange of information in a decentralized, distributed environment. It is an XML based protocol that consists of three parts: an envelope that defines a framework for describing what is in a message and how to process it, a set of encoding rules for expressing instances of application-defined datatypes, and a convention for representing remote procedure calls and responses.
All of the parts he mentions can be seen in the example in fig:soap_request.
The example shows the rendering (as a degenerated tree) of a SOAP request
in XML format. The root node is the envelop mentioned in the citation.
Details on how to process the XML data (Schema and encoding) are declared
in the attributes of the root node. Node number 3 contains the remote
procedure call getPrice that we will use to retrieve the price of
a book whose ISBN is contained in the character data of node number 4.
Notice that node 4 contains not only the ISBN itself but also declares
the data type xs:string of the parameter passed to the remote procedure
getPrice (being the ISBN).
Before we start coding the SOAP client, we have to find out how the
response to this request will look like. The tree in fig:soap_reply
is the XML data which comes as a response and that we have to traverse when
looking for the price of the book. A proper client would analyze the
tree thoroughly and first watch out for the type of node encountered.
The structure of fig:soap_reply will only be returned if the
request was successful; if the request had failed, we would be confronted
with different kinds of nodes describing the failure. It is one of the
advantages of SOAP that the response is not a static number or string,
but a tree with varying content. Error messages are not simply numbered
in a cryptic way, they come as XML elements with specific tag names and
character data describing the problem. But at the moment, we are only
interested in the successful case of seeing node 4 (tag return),
being of type xsd:float and containing the price as character
data.
The first part of our SOAP client in fig:soap_book_price_quote.awk
looks very similar to the RSS
client. The isbn is passed as a parameter from the command line
while the host name and the service identifier soap are fixed.
Looking at the variable response, you will recognize the tree
from Figure 7.6. Only the isbn is not fixed
but inserted as a variable into the XML data that will later be sent
to the SOAP server.
@load xml
BEGIN {
if (ARGC != 2) {
print "soap_book_price_quote - request price of a book via SOAP"
print "IN:\n ISBN of a book as a command-line parameter"
print "OUT:\n the price of the book on stdout"
print "EXAMPLE:"
print " gawk -f soap_book_price_quote.awk 0596000707"
print "JK 2004-10-17"
exit
}
host = "services.xmethods.net" # The name of the server to contact.
soap = "soap/servlet/rpcrouter" # The identifier of the service.
isbn = ARGV[1]; ARGV[1] = ""
# Switch off XML mode while reading and storing data.
XMLMODE=0
# Build up the SOAP request and integrate "isbn" variable.
request="\
<soap:Envelope xmlns:n='urn:xmethods-BNPriceCheck' \
xmlns:soap='http://schemas.xmlsoap.org/soap/envelope/' \
xmlns:soapenc='http://schemas.xmlsoap.org/soap/encoding/' \
xmlns:xs='http://www.w3.org/2001/XMLSchema' \
xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'> \
<soap:Body> \
<n:getPrice> \
<isbn xsi:type='xs:string'>" isbn "</isbn> \
</n:getPrice> \
</soap:Body> \
</soap:Envelope>"
Figure 7.8: First part of soap_book_price_quote.awk builds up a SOAP request
The second and third part of our SOAP client resemble the RSS client even more. But if you compare both more closely, you will find some interesting differences.
ORS is not used anymore because we handle the
header line differently here.
POST method and
not the GET method.
# When connecting, use port number 80 on host.
HttpService = "/inet/tcp/0/" host "/80"
# Setting RS is necessary for separating header from XML reply.
RS = "\r\n\r\n"
# Send out a SOAP-compliant request. First the header.
print "POST " soap " HTTP/1.0" |& HttpService
print "Host: " host |& HttpService
print "Content-Type: text/xml; charset=\"utf-8\"" |& HttpService
print "Content-Length: " length(request) |& HttpService
# Now separate header from request and then send the request.
print "" |& HttpService
print request |& HttpService
Figure 7.9: Second part of soap_book_price_quote.awk sends the SOAP request
Having sent the request, the only thing left to do is receiving
the response and traversing the XML tree. Just like the RSS client,
the SOAP client stores the XML response in a file temporarily and
opens this file as an XML file. While traversing the XML tree, our
client behaves very simple minded: character data is remembered
and printed as soon as a node with tag name return occurs.
# Receive the reply and save it.
HttpService |& getline Header
# We need a temporary file for the XML content.
soap_file="soap.xml"
# Make soap_file an empty file.
printf "" > soap_file
# Put each XML line into soap_file.
while ((HttpService |& getline) > 0)
printf "%s", $0 >> soap_file
close(HttpService) # this is optional since connection is empty
close(soap_file) # this is essential since we re-read the file
# Read soap_file (XML) and print the price of the book (ASCII).
XMLMODE=1
while ((getline < soap_file) > 0) {
if (XMLCHARDATA ) { price = $0 }
if (XMLENDELEM == "return") { print "The book costs", price, "US$."}
}
}
Figure 7.10: Third and final part of soap_book_price_quote.awk reads the SOAP response
What would happen in case of an error ? Each of the following cases would have to be handled in more detail if we were writing a proper SOAP client.
-1.0
as the prices. This is not a problem for our client, but the user
has to keep in mind to check for a negative price.
return node in the response (and therefore
the client will print nothing). Instead, the response will contain nodes
of type SOAP-ENV:Fault, faultcode, faultstring and
faultactor. This situation is depicted in Figure 7.11.
<h1>Error: 400</h1>
Error unmarshalling envelope: SOAP-ENV:VersionMismatch:
Envelope element must be associated with the
'http://schemas.xmlsoap.org/soap/envelope/' namespace.
I began each of the cases above with the word When because it is
not the question if such a case will ever happen but only when
it will happen. When writing software, it is always essential
to distinguish cases that can happen (by evaluating return codes or by
catching exceptions for example). But when writing software that connects
to a network it is inevitable to do so; otherwise the networking software
would be unreliable. So, most real world clients will be much longer than the
one we wrote in this section. But in many cases it is not really
complicated to handle error conditions. For example, by inserting the following
single line at the end of the while loop's body, you can do a first step
toward proper error handling.
if (XMLENDELEM ~ /^fault/) { print XMLENDELEM ":", price }
When the client should ever receive a response like the one in Figure 7.11, it would print the detected messages contained in nodes 4, 5 and 6.
faultcode: SOAP-ENV:Protocol
faultstring: Content length must be specified.
faultactor: /soap/servlet/rpcrouter
In the previous section we have seen how XML can be used as a data-exchange-format during a database query. Using XML in such database retrievals is quite commonplace today. But the actual storage format for the large databases in the background is usually not XML. Many proprietary solutions for database storage have established their niche markets over the decades. These will certainly not disappear just because a geeky new format like XML emerged out of the blue. As a consequence, the need for conversion of data between established databases and XML arises frequently in and around commercial applications. Fortunately, we have the query language SQL as an accepted standard for expressing the query. But unfortunately, the actual interface (the transport mechanism) for request and delivery of queries and results is not standardized at all. The rest of this section describes how this problem was solved by creating a GNU Awk extension for one specific database: PostgreSQL. All proprietary access mechanisms are encapsulated into an extension and an application script uses the extension. The problem at hand is to read a small contact database file (XML, fig:sample.xml) and write the data into a PostgreSQL database with the help of two GNU Awk extensions:
So it is not surprising that the application script in fig:testxml2pgsql.awk begins
with loading both extensions. GNU Awk extensions like pgsql
are usually implemented so that they make the underlying
C interface
accessible to the writer of the application script.
@load xml
@load pgsql
BEGIN {
# Note: should pass an argument to pg_connect containing PQconnectdb
# options, as discussed here:
# http://www.postgresql.org/docs/8.0/interactive/libpq.html#LIBPQ-CONNECT
# Or the parameters can be set in the environment, as discussed here:
# http://www.postgresql.org/docs/8.0/interactive/libpq-envars.html
# For example, a typical call might be:
# pg_connect("host=pgsql_server dbname=my_database")
if ((dbconn = pg_connect()) == "") {
printf "pg_connect failed: %s\n", ERRNO > "/dev/stderr"
exit 1
}
# these are the columns in the table
ncols = split("name email work cell company address", col)
# create a temporary table
sql = "CREATE TEMPORARY TABLE tmp ("
for (i = 1; i <= ncols; i++) {
if (i > 1)
sql = (sql",")
sql = (sql" "col[i]" varchar")
}
sql = (sql")")
if ((res = pg_exec(dbconn, sql)) !~ /^OK /) {
printf "Cannot create temporary table: %s, ERRNO = %s\n",
res, ERRNO > "/dev/stderr"
exit 1
}
# create a prepared insert statement
sql = ("INSERT INTO tmp ("col[1])
for (i = 2; i <= ncols; i++)
sql = (sql", "col[i])
sql = (sql") VALUES ($1")
for (i = 2; i <= ncols; i++)
sql = (sql", $"i)
sql = (sql")")
if ((insert_statement = pg_prepare(dbconn, sql)) == "") {
printf "pg_prepare(%s) failed: %s\n",sql,ERRNO > "/dev/stderr"
exit 1
}
}
Figure 7.12: First part of testxml2pgsql.awk connects to PostgreSQL
For example,
the function pg_connect is just a wrapper around a C function
with an almost identical name. This transparency is good practice,
but not compulsory. Other GNU Awk extensions may choose
to implement some opacity in the design of the interface.
Unsuccessful connection attempts are reported to the user of the application
script before termination. After a successful connection (with pg_connect),
the script tells PostgreSQL about the structure of the database.
This structure is of course inspired by the format of the contact database
in fig:sample.xml. Each field in the database (name, email, work, cell,
company address) is declared with an SQL statement that is executed by PostgreSQL.
After creation of this table, PostgreSQL is expected to respond with
an OK message, otherwise the attempt to create the table has
to be aborted. Finally, a prepared insert statement tells PostgreSQL
about details (fieldwidths) of the database.
<?xml version="1.0" encoding="utf-8"?>
<contact_database>
<contact>
<name>Joe Smith</name>
<phone type="work">1-212-555-1212</phone>
<phone type="cell">1-917-555-1212</phone>
<email>joe.smith@acme.com</email>
<company>Acme</company>
<address>32 Maple St., New York, NY</address>
</contact>
<contact>
<name>Ellen Jones</name>
<phone type="work">1-310-555-1212</phone>
<email>ellen.jones@widget.com</email>
<company>Widget Inc.</company>
<address>137 Main St., Los Angeles, CA</address>
</contact>
<contact>
<name>Ralph Simpson</name>
<phone type="work">1-312-555-1212</phone>
<phone type="cell">1-773-555-1212</phone>
<company>General Motors</company>
<address>13 Elm St., Chicago, IL</address>
</contact>
</contact_database>
Figure 7.13: The contact database to be stored with PostgreSQL
Now that the structure of the database is known to PostgreSQL, we are ready to read the actual data. Assuming that the script has been stored in a file testxml2pgsql.awk and the XML data in a file sample.xml, we can invoke the application like this:
gawk -f testxml2pgsql.awk < sample.xml
name|email|work|cell|company|address
Joe Smith|joe.smith@acme.com|1-212-555-1212|1-917-555-1212|Acme|32 Maple St., New York, NY
Ellen Jones|ellen.jones@widget.com|1-310-555-1212|<NULL>|Widget Inc.|137 Main St., Los Angeles, CA
Ralph Simpson|<NULL>|1-312-555-1212|1-773-555-1212|General Motors|13 Elm St., Chicago, IL
Notice that the data file is not passed as a parameter to
the interpreter, but the data file is redirected (< sample.xml)
to the standard input of the interpreter. This way of invocation will
hide the file's name from the application script, but still allow the
script to handle incoming XML data conveniently inside the curly braces
of fig:testxml2pgsql.awk.2, following the pattern-action
model of AWK.
You will also recognize that some fields which were empty in the
XML file appear as <NULL> fields in the output of the
script. Obviously, while reading the XML file, the application
script in fig:testxml2pgsql.awk.2 takes care which fields
are filled with data and which are empty in a data record.
{
switch (XMLEVENT) {
case "STARTELEM":
if ("type" in XMLATTR)
item[XMLPATH] = XMLATTR["type"]
else
item[XMLPATH] = XMLNAME
break
case "CHARDATA":
if ($1 != "")
data[item[XMLPATH]] = (data[item[XMLPATH]] $0)
break
case "ENDELEM":
if (XMLNAME == "contact") {
# insert the record into the database
for (i = 1; i <= ncols; i++) {
if (col[i] in data)
param[i] = data[col[i]]
}
if ((res = pg_execprepared(dbconn, insert_statement,
ncols, param)) !~ /^OK /) {
printf "Error -- insert failed: %s, ERRNO = %s\n",
res, ERRNO > "/dev/stderr"
exit 1
}
delete item
delete data
delete param
}
break
}
}
Figure 7.14: Second part of testxml2pgsql.awk transmits data to PostgreSQL
Most scripts you have seen so far follow the pattern-action
model of AWK in the way that is described in
XMLgawk Core Language Interface Summary as Style A.
The script in Figure 7.14 is different in that
it employs Style B. Each XML event that comes in is analyzed
inside a switch statement for its type. When Style A
would have called for an XMLSTARTELEM pattern in front of
an action, Style B looks at the content of XMLEVENT
and switches to the STARTELEM case for an action to be done.
The action itself (collecting data from attribute type or
the tag name) remains the same in both styles. The case of
CHARDATA will remain hard to understand, unless you have a
look at Working with XML paths, where the idea of collecting
data from terminal XML nodes is explained. Remember that most of
the data in Figure 7.13 is stored as character data in
the terminal XML nodes. Whenever a contact node is finished,
it is time to store the collected data into PostgreSQL and the
variables have to be emptied for collecting data of the next contact.
All this collecting and storing data will be repeated until there are
no more XML events coming in. By then, PostgreSQL will contain the
complete database.
END {
if (dbconn != "") {
# let's take a look at what we have accomplished
if ((res = pg_exec(dbconn, "SELECT * FROM tmp")) !~ /^TUPLES /)
printf "Error selecting * from tmp: %s, ERRNO = %s\n",
res, ERRNO > "/dev/stderr"
else {
nf = pg_nfields(res)
for (i = 0; i < nf; i++) {
if (i > 0)
printf "|"
printf "%s", pg_fname(res, i)
}
printf "\n"
nr = pg_ntuples(res)
for (row = 0; row < nr; row++) {
for (i = 0; i < nf; i++) {
if (i > 0)
printf "|"
printf "%s",
(pg_getisnull(res,row,i) ? "<NULL>" : pg_getvalue(res,row,i))
}
printf "\n"
}
}
pg_disconnect(dbconn)
}
}
Figure 7.15: Final part of testxml2pgsql.awk reads back data from PostgreSQL
The third and final part of our application script in
Figure 7.15 will help us verify that everything
has been stored correctly. In order to do so, we will also see
how a PostgreSQL database can be read from within our script.
Whenever you have an END pattern in a script, the following
action will be triggered after all data has been read; irrespective
of the success of initializations done earlier. The situation is
comparable to the try (...) catch sequence in the exception
handling of programming languages of lesser importance. In such an
”exception handler”, there are very few assertions about the state
of variables that you can rely on. Therefore, before using any
variable, you have to check that it is valid. That's what's done
first in Figure 7.15:
Reading the database makes sense only when it was actually opened.
If it was in fact opened, then it makes sense to transmit an
SQL statement to PostgreSQL. After a successful transmission (and
only then) the returned result can be split up into fields. All fields
of all rows are printed, but only for those fields that are non-empty.
While reading AWK and XML Concepts, you might have wondered how
the DocBook file in Figure 1.2 was turned into the drawing
of a tree in Figure 1.3. The drawing was not produced
manually but with a conversion tool – implemented as a gawk script.
The secret in finding a good solution for an imaging problem always is to
find the right tool to employ. At the AT&T Labs, there is a project group
working on GraphViz,
an open source software package for drawing graphs. Graphs are data structures
which are more general than trees, so they include trees as a special case
and the dot tool
can produce nice drawings like the one in Figure 1.3.
Before you go and download the source code of dot, have a look at your
operating system's distribution media – dot comes for free with most
GNU/Linux distributions.
But the question remains, how to turn an XML file into the image of a graph ?
dot only reads
textual descriptions
of graphs and produces Encapsulated
PostScript files, which are suitable for inclusion into documents. These
textual descriptions look like fig:source_dot, which contains
the dot source code for the tree in Figure 1.3.
So the question can be recast as how to convert Figure 1.2 into
fig:source_dot ? After a bit of comparison, you will notice that
fig:source_dot essentially has one struct line for each
node (containing the node's name – the tag of the markup block) and one
struct line for each edge in the tree (containing the number of the
node to which it points). The very first struct1 is a bit different.
struct1 contains the root node of the XML file. In the tree, this
node has no number but it is framed with a bold line, while all the other
nodes are numbered and are not framed in a special way. In the remainder
of this section, we will find out how the script outline_dot.awk
in fig:outline_dot.awk converts an XML file into a graph
description which can be read by the dot tool.
digraph G {
rankdir=LR
node[shape=Mrecord]
struct1[label="<f0>book| lang='en'| id='hello-world' "];
struct1 [style=bold];
struct2[label="<f0>bookinfo "];
struct1 -> struct2:f0 [headlabel="2\n\n"]
struct3[label="<f0>title "];
struct2 -> struct3:f0 [headlabel="3\n\n"]
struct4[label="<f0>chapter| id='introduction' "];
struct1 -> struct4:f0 [headlabel="4\n\n"]
struct5[label="<f0>title "];
struct4 -> struct5:f0 [headlabel="5\n\n"]
struct6[label="<f0>para "];
struct4 -> struct6:f0 [headlabel="6\n\n"]
struct7[label="<f0>sect1| id='about-this-book' "];
struct4 -> struct7:f0 [headlabel="7\n\n"]
struct8[label="<f0>title "];
struct7 -> struct8:f0 [headlabel="8\n\n"]
struct9[label="<f0>para "];
struct7 -> struct9:f0 [headlabel="9\n\n"]
struct10[label="<f0>sect1| id='work-in-progress' "];
struct4 -> struct10:f0 [headlabel="10\n\n"]
struct11[label="<f0>title "];
struct10 -> struct11:f0 [headlabel="11\n\n"]
struct12[label="<f0>para "];
struct10 -> struct12:f0 [headlabel="12\n\n"]
}
Figure 7.16: An example of a tree description for the dot tool
Before delving into the details of fig:outline_dot.awk, step back
for a moment and notice the structural similarity between this gawk
script and the one in Figure 3.2. Both determine the depth
of each node while traversing the tree. In the BEGIN section of
fig:outline_dot.awk, only the three print lines were added,
which produce the three first lines of Figure 7.16. The same
holds for the one print line in the END section of
fig:outline_dot.awk, which only finalizes the textual description of
the tree in Figure 7.16. As a consequence, all the struct
lines in Figure 7.16 are produced while traversing the tree
in the XMLSTARTELEM section of fig:outline_dot.awk.
Each time we come across a node, two things have to be done:
To simplify identification of nodes, the node counter n is incremented.
Then n is appended to the struct and allows us to identify each
node by name. Identifying nodes through the tag name of the markup block is
not possible because tag names are not unique. At this stage we are ready to
insert the node into the drawing by printing a line like this:
struct3[label="<f0>title "];
The label of the node is the right place to insert the tag name of the
markup block (XMLSTARTELEM). If there are attributes in the node, they
are appended to the label after a separator character.
@load xml
BEGIN {
print "digraph G {"
print " rankdir=LR"
print " node[shape=Mrecord]"
}
XMLSTARTELEM {
n ++
name[XMLDEPTH] = "struct" n
printf("%s", " " name[XMLDEPTH] "[label=\"<f0>" XMLSTARTELEM)
for (i in XMLATTR)
printf("| %s='%s'", i, XMLATTR[i])
print " \"];"
if (XMLDEPTH==1)
print " " name[1], "[style=bold];"
else
print " " name[XMLDEPTH-1], "->", name[XMLDEPTH] ":f0 [headlabel=\""n"\\n\\n\"]"
}
END { print "}" }
Figure 7.17: outline_dot.awk turns an XML file into a tree description for the dot tool
Now that we have a name for the node, we can draw an edge from its parent node
to the node itself. The array name always contains the identifiers
of the most recently traversed node of given depth. Since we are traversing the
tree depth-first, we can always be sure that the most recently traversed
node of a lesser depth is a parent node. With this assertion in mind, we can
easily identify the parent by name and print a line from the parent node to the node.
struct2 -> struct3:f0 [headlabel="3\n\n"]
The root node (XMLDEPTH==1) is a special case which is easier to handle.
It has no parent, so no edge has to be drawn, but the root node gets
a special recognition by framing it with a bold line.
Now, store the script into a file and invoke it.
The output of gawk is piped directly into dot.
dot is instructed to store Encapsulated PostScript
output into a file tree.eps.
dot converts this description into a nice graphical rendering in the PostScript format.
gawk -f outline_dot.awk dbfile.xml | dot -Tps2 -o tree.eps
We have already talked about validation in Checking for well-formedness.
There, we have learned that gawk does not validate XML files
against a DTD. So, does this mean we have to ignore the topic at all ?
No, the use of DTDs is so widespread that everyone working with XML files
should at least be able to read them. There are at least two good reason why
we should take DTDs for serious:
In such cases, you wish you had a tool like DTDGenerator – A tool to generate XML DTDs. This is a tool which takes a well-formed XML file and produces a DTD out of it (so that the XML file is valid against the DTD, of course). Let's take Figure 1.2 as an example. This is a DocBook file, which has a well-established DTD named in its header. Imagine the DocBook file was much longer and you had an application which required reading and processing the file. Would you go for the complete DocBook DTD and taylor your application to handle all the details in the DocBook DTD ? Probably not. It is more practical to start with a subset of the DTD which is good enough to describe the file at hand. A DTD Generator will produce a DTD which can serve well as a starting point. The DocBook file in Figure 1.2 for example can be described by the DTD in fig:dbfile.dtd. Unlike most DTDs you will find in the wild, this DTD uses indentation to emphasize the structure of the DTD. Attributes are always listed immediately after their element name and the sub-elements occuring in one element follow immediately below, but indented.
You should take some time and try to understand the relationship
of the elements and attributes listed in fig:dbfile.dtd
and the example file in Figure 1.2. The first line of
fig:dbfile.dtd for example tells you that a book
consists of a sequence of elements, which are either a chapter
or a bookinfo. You can verify this by looking at the drawing
in Figure 1.3. The next two lines tell you that
a book has two mandatory attributes, lang and id.
The rest of fig:dbfile.dtd is indented and describes
all other elements and their attributes in the same way. Elements that
have no other elements included in them have #PCDATA in them.
<!ELEMENT book ( chapter | bookinfo )* >
<!ATTLIST book lang CDATA #REQUIRED>
<!ATTLIST book id CDATA #REQUIRED>
<!ELEMENT chapter ( sect1 | para | title )* >
<!ATTLIST chapter id CDATA #REQUIRED>
<!ELEMENT sect1 ( para | title )* >
<!ATTLIST sect1 id CDATA #REQUIRED>
<!ELEMENT para ( #PCDATA ) >
<!ELEMENT title ( #PCDATA ) >
<!ELEMENT bookinfo ( title )* >
Figure 7.18: Example of a DTD, arranged to emphasize nesting structure
The rest of this section consists of the description of the
script which produced the DTD in Figure 7.18. The first
part of the scripts looks rather similar to Figure 7.17.
Both scripts traverse the tree of nodes in the XML file and accumulate
information in a very similar way (the array name and the variable
XMLDEPTH for example). Three additional array variables in
fig:dtd_generator.awk are responsible for storing the information
needed for generating a DTD later.
elem[e] learns all element names and counts how often each element occurs.
This is necessary for knowing their names and determining if a certain element
occured always with a certain attribute.
child[ep,ec] learns which element is the child of which other element.
This is necessary for generating the details of the <!ELEMENT ...>
lines in Figure 7.18.
attr[e,a] learns which element has which attributes.
This is necessary for generating the details of the <!ATTLIST ...>
lines in Figure 7.18.
@load xml
# Remember each element.
XMLSTARTELEM {
# Remember the parent names of each child node.
name[XMLDEPTH] = XMLSTARTELEM
if (XMLDEPTH>1)
child[name[XMLDEPTH-1], XMLSTARTELEM] ++
# Count how often the element occurs.
elem[XMLSTARTELEM] ++
# Remember all the attributes with the element.
for (a in XMLATTR)
attr[XMLSTARTELEM,a] ++
}
END { print_elem(1, name[1]) } # name[1] is the root
Figure 7.19: First part of dtd_generator.awk — collecting information
Having completed its traversal of the tree and knowing all names
of elements and attributes and also their nesting structure, the
action of the END pattern only invokes a function which
starts resolving the relationships of elements and attributes and
prints them in the form of a proper DTD. Notice that name[1]
contains the name of the root node of the tree. This means that the
description of the DTD begins with the top level element of the XML
file (as can be seen in the first line of Figure 7.18).
# Print one element (including sub-elements) but only once.
function print_elem(depth, element, c, atn, chl, n, i, myChildren) {
if (already_printed[element]++)
return
indent=sprintf("%*s", 2*depth-2, "")
myChildren=""
for (c in child) {
split(c, chl, SUBSEP)
if (element == chl[1]) {
if (myChildren=="")
myChildren = chl[2]
else
myChildren = myChildren " | " chl[2]
}
}
# If an element has no child nodes, declare it as such.
if (myChildren=="")
print indent "<!ELEMENT", element , "( #PCDATA ) >"
else
print indent "<!ELEMENT", element , "(", myChildren, ")* >"
# After the element name itself, list its attributes.
for (a in attr) {
split(a, atn, SUBSEP)
# Treat only those attributes that belong to the current element.
if (element == atn[1]) {
# If an attribute occured each time with its element, notice this.
if (attr[element, atn[2]] == elem[element])
print indent "<!ATTLIST", element, atn[2], "CDATA #REQUIRED>"
else
print indent "<!ATTLIST", element, atn[2], "CDATA #IMPLIED>"
}
}
# Now go through the child nodes of this elements and print them.
gsub(/[\|]/, " ", myChildren)
n=split(myChildren, chl)
for(i=1; i<=n; i++) {
print_elem(depth+1, chl[i])
split(myChildren, chl)
}
}
Figure 7.20: Second part of dtd_generator.awk — printing the DTD
The first thing this function does is to decide whether the element
to be printed has already been printed (if so, don't print it twice).
Proper indentation is done by starting each printed line with a
number of spaces (twice as much as the indentation levels).
Next comes the collection of all child nodes of the current element
into the string myChildren. AWK's split function is
used for breaking up the tuple of elements (parent and child) that
make up an associative array index. Having found all children, we
are ready to print the <!ELEMENT ... > line for this element
of the DTD. If an element has no children, then it is a leaf of the
tree and it is marked as such in the DTD. Otherwise all the children
found are printed as belonging to the element.
Finding the right <!ATTLIST ... > line is coded in a similar way.
Each attribute is checked if it has ever occured with the element
and if so, it is printed. The distinction between an attribute that
occurs always with the element and an attribute that occurs sometimes
with the element is the first stage of refinement in this generator.
But if you analyze the generated DTD a bit, you will notice that
the DTD is a rather coarse and liberal DTD.
#PCDATA.
CDATA.
Using the XSD Inference Utility http://msdn2.microsoft.com/en-us/library/aa302302.aspx
Feel free to refine this generator according to your needs.
Perhaps, you can even generate a Schema file along the lines
of Microsoft's XSD Inference Utility, see
Using the XSD Inference Utility.
The rest of the
function print_elem() should be good enough for further
extensions. It takes the child nodes of the element (which
were collected earlier) and uses the function recursively
in order to print each of the children.
It happens rather seldom, but sometimes we have to write a program
which reads an XML file tag by tag and looks very carefully at the
context of a tag and the character data embedded in it. Such programs
detect the sequence, indentation and context of the tags and evaluate
all this in an application specific manner, almost like a compiler or
an interpreter does. These programs are called parsers. Their creation
is not trivial and if you ever have to write a parser, you will be
grateful to find a way of producing the first step of a parser
automatically from an example file. Quite naturally, some commercial
tools exist which promise to generate a parser for you. For example,
the XMLBooster XMLBooster product
generates not only
a parser (in any language in any of the languages C, C++, C#, COBOL,
Delphi, Java or Ada) but also convenient structural documentation
and even a GUI for editing your specific XML files. The XMLBooster uses
an existing DTD or Schema file to generate all these things. Unlike
the XMLBooster, we will not assume that any DTD or Schema file exists
for given XML data. We want our parser generator to take specific
XML data as input and produce a parser for such data.
In the previous section
Generating a DTD from a sample file we already saw how an XML
file was analyzed and a different file was generated, which contained
the syntactical relationship between different kinds of tags. As we
will see later, a parser can be created in a very similar way. So, in
this section we will change the program from the previous
section, leaving everything unchanged, except for the
function print_elem().
Once more, let's take Figure 1.2 (the DocBook file) as an example.
A parser for DocBook files of this kind could begin like the program in
fig:parser_dbfile. In the BEGIN part of the parser,
the very first tag is read by a function NextElement() which we
will see later. If this very first tag is a book tag, then
parsing will go on in a function named after the tag. Otherwise,
the parser will assume that the root tag of the XML file was not
the one expected and the parser terminates with an error message.
In the function parse_book we see a loop, reading one tag after
the other until the closing book tag is read. In between,
each subsequent tag is checked against the set of allowed tags and another
function for handling that tag is invoked. Unexpected tag names lead to
a warning message being emitted, but not to the termination of the parser.
The most important principle in this parser is that
for each tag name, one function exists for parsing tags of its kind.
These functions invoke each other while parsing the XML file (perhaps
recursively, if the XML markup blocks were formed recursively).
Each of these
functions has a header with comments in it, naming the attributes
which come with a tag of this name. Now, look at the parse_book
function and imagine you had to generate such a function. Remember
how we stored information about each kind of tag when we wrote the
DTD generator. You will find that all the information needed about
a tag is already available, we only have to produce a different kind
of output here.
BEGIN {
if (NextElement() == "book") {
parse_book()
} else {
print "could not find root element 'book'"
}
}
function parse_book() {
# The 'book' node has the following attributes:
# Mandatory attribute 'lang'
# Mandatory attribute 'id'
while (NextElement() && XMLENDELEM != "book") {
if (XMLSTARTELEM == "chapter") {
parse_chapter()
} else if (XMLSTARTELEM == "bookinfo") {
parse_bookinfo()
} else {
print "unknown element '" XMLSTARTELEM "' in 'book' line ", XMLROW
}
}
}
Figure 7.21: Beginning of a generated parser for a very simple DocBook file
Now that the guiding principle (recursive descent) is clear, we can turn to the details. The hardest problem in understanding the parser generator will turn out to be the danger of mixing up the kinds of text and data involved. Whenever you turn in circles while trying to understand what's going on, remember the kind of data you are thinking about:
Traditional language parsers read their input text token by token.
The work is divided up between a low-level character reader and
a high-level syntax checker. On the lowest level, a token is singled
out by the scanner, which returns the token to the parser itself.
In a generated parser for XML data, we don't need our own scanner
because the scanner is hidden in the XML reader that we use. What remains
to be generated is a function for reading the next token upon each invocation.
This token-by-token reader in fig:parser_generated_next_element is
implemented in the pull-parser style we have seen earlier.
Notice that the function NextElement() implementing this reader remains
the same in each generated parser. While reading the XML file with
getline, the reader watches for any of the following events in the
token stream:
XMLSTARTELEM is a new tag to be returned.
XMLENDELEM is the end of a markup block to be returned.
XMLCHARDATA is text embedded into a markup block.
XMLERROR is an error indicator, leading to termination.
Text embedded into a markup block is not returned as the function's
return value but is stored into the global variable data.
This function is meant to return the name of a tag — no matter
if it is the beginning or the ending of a markup block. If the caller
wants to distinguish between beginning or ending of a markup block,
he can do so by watching if XMLSTARTELEM or XMLENDELEM
is set. Only when the end of an XML file is reached will an empty
string be returned. It is up to the caller to detect when the end
of the token stream is reached.
@load xml
function NextElement() {
while (getline > 0 && XMLERROR == "" && XMLSTARTELEM == XMLENDELEM)
if (XMLCHARDATA) data = $0
if (XMLERROR) {
print "error in row", XMLROW ", col", XMLCOL ":", XMLERROR
exit
}
return XMLSTARTELEM XMLENDELEM
}
Figure 7.22: The pull-style token reader; identical in all generated parsers
All the code you have seen in this section up to here was generated
code. It makes no sense to copy this code into your own programs. What
follows now is the generator itself. As mentioned earlier, the
generator is identical to the dtd_generator.awk of the previous
section — you only have to replace the function print_elem()
with the version you see in fig:parser_generator1 and fig:parser_generator2.
The beginning of the function print_elem() is easy to understand —
it generates the function NextElement() as you have seen the function
in Figure 7.22. We only need NextElement()
generated once, so we generate it only when the root tag (depth == 1) is handled.
Just like NextElement(), we also need the BEGIN pattern of
Figure 7.21 only once, so it is generated immediately after
NextElement(). What follows is the generation
of the comments about XML attributes as you have seen them in Figure 7.21.
This coding style should not be new to you if you have studied the dtd_generator.awk.
Notice that each invocation of the print_elem() for non-root tags
(depth > 1) produces one function (which is named after the tag).
function print_elem(depth, element, c, atn, chl, n, i, myChildren) {
if (depth==1) {
print "@load xml"
print "function NextElement() {"
print " while (getline > 0 && XMLERROR == \"\" && XMLSTARTELEM == XMLENDELEM)"
print " if (XMLCHARDATA) data = $0"
print " if (XMLERROR) {"
print " print \"error in row\", XMLROW \", col\", XMLCOL \":\", XMLERROR"
print " exit"
print " }"
print " return XMLSTARTELEM XMLENDELEM"
print "}\n"
print "BEGIN {"
print " if (NextElement() == \"" element "\") {"
print " parse_" element "()"
print " } else {"
print " print \"could not find root element '" element "'\""
print " }"
print "}\n"
}
if (already_printed[element]++)
return
print "function parse_" element "() {"
print " # The '" element "' node has the following attributes:"
# After the element name itself, list its attributes.
for (a in attr) {
split(a, atn, SUBSEP)
# Treat only those attributes that belong to the current element.
if (element == atn[1]) {
# If an attribute occured each time with its element, notice this.
if (attr[element, atn[2]] == elem[element])
print indent " # Mandatory attribute '" atn[2] "'"
else
print indent " # Optional attribute '" atn[2] "'"
}
}
print ""
Figure 7.23: The first part of print_elem() in parser_generator.awk
This was the first part of print_elem(). The second part in
fig:parser_generator2 produces the body of the function
(see function parse_book() in Figure 7.21
for a generated example). In the body of the newly generated function
we have a while loop which reads tokens until the currently
read markup block ends with a closing tag. Meanwhile each embedded
markup block will be detected and completely read by another function.
Tags of embedded markup blocks will only be accepted when they belong
to a set of expected tags. The rest of the function should not be new
to you, it descends recursively deeper into the tree of embedded markup
blocks and generates one function for each kind of tag.
print " while (NextElement() && XMLENDELEM != \"" element "\") {"
myChildren=""
for (c in child) {
split(c, chl, SUBSEP)
if (element == chl[1]) {
if (myChildren=="")
myChildren = chl[2]
else
myChildren = myChildren " | " chl[2]
print " if (XMLSTARTELEM == \"" chl[2] "\") {"
print " parse_" chl[2] "()"
printf " } else "
}
}
if (myChildren != "") {
print " {"
printf " print \"unknown element '\" XMLSTARTELEM \"'"
print " in '" element "' line \", XMLROW\n }"
print " }"
} else {
# If an element has no child nodes, declare it as such.
print " # This node is a leaf."
print " }"
print " # The character data is now in \"data\"."
}
print "}\n"
# Now go through the child nodes of this elements and print them.
gsub(/[\|]/, " ", myChildren)
n=split(myChildren, chl)
for(i=1; i<=n; i++) {
print_elem(depth+1, chl[i])
split(myChildren, chl)
}
}
Figure 7.24: The second part of print_elem() in parser_generator.awk
When the complete parser is generated from the example file, you have a commented parser that serves well as a starting point for further refinements. Most importantly, you will add code for evaluation of the XML attributes and for printing results. Although this looks like an easy start into the parsing business, you should be aware of some limitations of this approach:
The previous two sections about generating text files from an XML example file were rather abstract and might have confused you. This section will be different. Here, we will put the program parser_generator.awk to work and see what it's good for. We will generate a parser for the kind of XML output that Microsoft's Excel application produces. Our starting point will be an XML file that we have retrieved from the Internet.
Before we put the parser generator to work, let's repeat once more that the parser generator consists of the source code presented in Figure 7.19, Figure 7.23 and Figure 7.24. Put these three fragments into a file named parser_generator.awk.
Now is the time to look for an XML file produced by Microsoft Excel that will be used by the generator. The example file should contain all relevant structural elements and attributes. Only these will be recognized by the generated parser later. On the Internet I looked for an example file that contained as much valid elements and attributes as possible. I found several file which are freely available and could serve well as templates, but none of them contained all kinds of elements and attributes. Two of the most complete were the following ones. Invoke these commannds and you will find the two files in your current working directory:
wget http://csislabs.palomar.edu/Student/csis120/Matthews/StudentDataFiles/Excel/PastaMidwest.xml
wget http://csislabs.palomar.edu/Student/csis120/Matthews/StudentDataFiles/Excel/Oklahoma2004.xml
If you have some examples of your own, pass their names to the parser generator along with the others like this:
gawk -f parser_generator.awk PastaMidwest.xml Oklahoma2004.xml > ms_excel_parser.awk
Now you will find a new file ms_excel_parser.awk in your current working directory. This is the recursive descent parser, ready to parse and recognize all elements that were present in the template files above. To prove the point, we let the new parser work on the template files and check if these obey the rules:
gawk -f ms_excel_parser.awk PastaMidwest.xml
gawk -f ms_excel_parser.awk Oklahoma2004.xml
gawk -f ms_excel_parser.awk xmltv.xml
could not find root element 'Workbook'
Obviously, the file xmltv.xml from Convert XMLTV file to tabbed ASCII was the only file that did not obey the rules, which is not surprising. Each XML file exported from Microsoft Excel has a node of type Workbook as its root node. These Workbook nodes are parsed by the program ms_excel_parser.awk right at the beginning in the following function:
BEGIN {
if (NextElement() == "Workbook") {
parse_Workbook()
} else {
print "could not find root element 'Workbook'"
}
}
function parse_Workbook() {
# The 'Workbook' node has the following attributes:
# Mandatory attribute 'xmlns:html'
# Mandatory attribute 'xmlns:x'
# Mandatory attribute 'xmlns'
# Mandatory attribute 'xmlns:o'
# Mandatory attribute 'xmlns:ss'
while (NextElement() && XMLENDELEM != "Workbook") {
if (XMLSTARTELEM == "Styles") {
parse_Styles()
} else if (XMLSTARTELEM == "Worksheet") {
parse_Worksheet()
} else if (XMLSTARTELEM == "ExcelWorkbook") {
parse_ExcelWorkbook()
} else if (XMLSTARTELEM == "OfficeDocumentSettings") {
parse_OfficeDocumentSettings()
} else if (XMLSTARTELEM == "DocumentProperties") {
parse_DocumentProperties()
} else {
print "unknown element '" XMLSTARTELEM "' in 'Workbook' line ", XMLROW
}
}
}
Figure 7.25: A generated code fragment from ms_excel_parser.awk
If the root node is not a node of type Workbook (like it's the case with the file xmltv.xml), then a report about a missing root element is printed. As you can easily see, a Workbook has several mandatory attributes. The generated parser could be extended to also check the presence of these. Furthermore, a Workbook is a sequence of nodes of type Styles, Worksheet, ExcelWorkbook, OfficeDocumentSettings or DocumentProperties.
This chapter is meant to be a reference. It lists features in a precise and comprehensive way without motivating their use. First comes a section listing all builtin variables and environment variables used by the XMLgawk extension. Then comes a section explaining the two different ways that these variables can be used. Finally, we have two sections explaining libraries which were built upon the XMLgawk extension.
This section presents all variables and functions which constitute
the XML extension of GNU Awk. For each variable one XML example fragment
explains which XML code causes the pattern to be set. After this event has
passed, the variable contains the empty string. So you cannot rely on
a variable retaining a value until later, when the same kind of events sets a
different value. Since we are not reading lines (but XML events),
the variable $0 is usually not set to any text value
but to the empty string. Setting $0 is seen as a side effect in
XML mode and mentioned as such in this reference.
XMLDECLARATION: integer indicates begin of document<?xml version="1.0" encoding="UTF-8"?>
If an XML document has a header (containing the XML declaration),
then the header will always precede all other kind of data —
even comments, character data and processing instructions.
Therefore the XMLDECLARATION (if there is one at all),
will always be the very first event to be read from the file.
When it has occurred, the XMLATTR array will be populated
with the index items VERSION, ENCODING, and
STANDALONE.
# The very first event holds the version info.
XMLDECLARATION {
version = XMLATTR["VERSION" ]
encoding = XMLATTR["ENCODING" ]
standalone = XMLATTR["STANDALONE"]
}
Each of the entries in the XMLATTR array only exists if
the respective item existed in the XML data.
XMLMODE: integer for switching on XML processingThis integer variable will not be changed by the interpreter. Its initial value is 0. The user sets it (to a value other than 0) to indicate that each file opened afterwards will be read as an XML file. Setting the variable to 0 again will cause the interpreter to read subsequent files as ordinary text files again.
It is allowed to have several files opened at the same time,
some of them XML files and others text files. After opening
a file in one mode or the other, it is not possible to go on
reading the same file in the other mode by changing the value of
XMLMODE.
Many users need to read XML files which have multiple root elements.
Such XML files are (strictly speaking) not really well-formed:
Well-formed XML documents have only one root element.
Setting XMLMODE to -1 tells the interpreter to accept
XML documents with more than one root element.
The use of the line ”@load xml” sets XMLMODE to -1 as a side-effect.
The use of the command line option ”-l xml' does the same.
So, most users prefer the latter methods instead of setting XMLMODE
directly. Invoking the GNU Awk interpreter under the name
xgawk or xmlgawk has the same side-effect.
XMLSTARTELEM: string holds tag upon entering element <book id="hello-world" lang="en">
...
</book>
Upon entering a markup block, the XML parser finds a tag
(book in the example) and copies its name into the
string XMLSTARTELEM variable. Whenever this variable
is set, you can take its value and store the value in another
variable, but you cannot access the tag name of the
enclosing (or the included) markup blocks. As a side effect,
the associative array XMLATTR and the variable $0
are filled. The variable $0 holds the names of the attributes
in the order of occurrence in the XML data. Attribute names in
$0 are separated by space characters. The variables $1
through $NF contain the individual attribute names in the
same order.
XMLATTR: array holds attribute names and values <book id="hello-world" lang="en">
...
</book>
This associative is always empty, except when XMLSTARTELEM,
or XMLDECLARATION, or XMLSTARTDOCT is true. In all
these cases, XMLATTR is used for passing the values of
several named attributes (in the widest sense) to the user.
XMLSTARTELEM, the array XMLATTR is filled
with the names of the attributes of the currently parsed element.
Each attribute name is inserted as an index into the array and the
attribute's value as the value of the array at this index. Notice
that the array is also empty when XMLENDELEM is set.
XMLDECLARATION, the array is filled with
variables named VERSION, ENCODING, and STANDALONE,
reflecting the parameters of the XML header. Each of these variables
is optional.
XMLSTARTDOCT, the array is filled with
variables named PUBLIC, SYSTEM, and INTERNAL_SUBSET,
reflecting the parameters of the same name within the DTD. Each of
these variables is optional.
In the example we have XMLSTARTELEM and XMLATTR set to
# Print all attribute names and values.
XMLSTARTELEM = "book"
XMLATTR["id" ] = "hello-world"
XMLATTR["lang"] = "en"
$0 = "id lang"
$1 = "id"
$2 = "lang"
NF = 2
XMLENDELEM: string holds tag upon leaving element <book id="hello-world" lang="en">
<bookinfo>
<title>Hello, world</title>
</bookinfo>
...
</book>
Upon leaving a markup block, the XML parser finds a tag
(book in the example) and copies its name into the
string XMLENDELEM variable. This variable is not
as useless at it may seem at first sight. An action triggered
by the pattern XMLENDELEM is usually the right place to
process the character data (here Hello, world) that was
accumulated inside an XML element (here title). If the
XML element book contains a list of nested elements bookinfo,
then the pattern XMLENDELEM == "book" may trigger an action
that processes the list of bookinfo data, which was
collected while parsing the book element.
The array XMLATTR is empty at this time instant.
XMLCHARDATA: string holds character data <title>Warning</title>
<para>This is still under construction.</para>
Any textual data interspersed between the markup tags is
called character data. Each occurrence of character data
is indicated by setting XMLCHARDATA. The actual data
is passed in $0 and may contain any text that is coded
in the currently used character encoding. There are possibly
0 bytes contained in the data. The length of the data
in bytes may differ from the number of characters reported as
length($0), for example in
Japanesse texts. The character data reported in $0 need
not be byte-by-byte identical to the original XML data (because
of a potentially different encoding). Line breaks are often
contained in character data, like it is the case in the example above.
All consecutive character data in the XML document will be
passed in $0 in one turn. Thus, line breaks may be contained
in $0.
# Collect character data and report it at end of tagged data block.
XMLCHARDATA { data = $0 }
XMLENDELEM == "title" { title = data }
XMLENDELEM == "link" { link = data }
XMLENDELEM == "item" { print "title", title, "contains", item, "and", link }
XMLPROCINST: string holds processing instruction target <? echo ("this is the simplest, an SGML processing instruction\n"); ?>
Processing instructions begin with <? and end with ?>.
The name immediately following the <? is the target.
The rest of the processing instruction is application specific.
The target is passed to the user in XMLPROCINST and the
content of the processing instruction is passed in $0.
# Find out what kind of processing instruction this is.
switch (XMLPROCINST) {
case "PHP": print "PI contains PHP source:", $0 ; break
case "xml-stylesheet": print "PI contains stylesheet source:", $0 ; break
}
XMLCOMMENT: string holds comment<!-- This is a comment -->
Comments in an XML document look the same as in HTML.
Whenever one occurs, the XML parser sets XMLCOMMENT to 1
and passes the comment itself in $0.
# Report comments.
XMLCOMMENT { print "comment:", $0 }
XMLSTARTCDATA: integer indicates begin of CDATA <script type="text/javascript">
<![CDATA[
... unescaped script content may contain any character like < and "...
]]>
</script>
Character data is not allowed to contain a < character
because this character has a special meaning as a tag indicator.
The same is true for four other characters. All five characters
have to be escaped (<) when used in an XML document.
The CDATA section in an XML document is a way to avoid
the need for escaping. A CDATA section starts with
<![CDATA[ and ends with ]]>. Everything inside a
CDATA section is ignored by the XML parser, but the content is
passed to the user.
Upon occurrence of a CDATA section, XMLSTARTCDATA
is set to 1 and $0 holds the content of the
CDATA section. Notice that a CDATA section
cannot contain the string ]]>, therefore, nested CDATA
sections are not allowed.
XMLENDCDATA: integer indicates end of CDATAWhenever the XMLENDCDATA is set, the CDATA section
has ended and the XML parser starts parsing the data as XML
data again. The closing ]]> of the CDATA section
is not passed to the user.
LANG: env variable holds default character encoding
The operating system's environment at run-time of the GNU Awk
interpreter has an environment variable LANG, which is
part of the locale mechanism of the operating system. Its value
determines the character encoding used by the interpreter. This
value is visible to the user as the initial value of the
XMLCHARSET variable.
# Print the character encoding of the user's environment.
BEGIN { print "LANG =", XMLCHARSET }
Sometimes the value of the LANG variable at the shell level
is not copied verbatim into the XMLCHARSET; the operating
system may choose to resolve aliases.
XMLCHARSET: string holds current character set<?xml version="1.0" encoding="x-sjis-cp932"?>
This string is initially set to the current character set
of the interpreter's environment (nl_langinfo(CODESET)).
Although it is initially set by the interpreter, this string
is meant to be set by the user when he needs data to be converted
to a different character encoding. The XML header above, for example,
is delivered in a Japanese encoding, and it may be necessary
to convert the data to UTF-8 for other applications to read it.
# Set the character encoding so that XML data will be converted.
BEGIN { XMLCHARSET = "utf-8" }
Later, when XML files are opened by the interpreter, all XML
data will be converted to the character set whose name was set
by the user in XMLCHARSET. Notice that changes to
XMLCHARSET will not take effect immediately, but only
on the subsequent opening of any file. Such changes will affect
only the file opened with the changed XMLCHARSET and
not files opened prior to the change.
XMLSTARTDOCT: root tag name indicates begin of DTD <?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE greeting [ <!ELEMENT greeting (#PCDATA)> ]>
<greeting>Hello, world!</greeting>
Valid XML data includes a reference to the DTD against which it should
be validated. The XMLSTARTDOCT variable indicates the beginning
of a DTD reference section. Such DTDs are either embedded into
the XML data (like in the example above), or they are actual
references to DTD files (like in the example below). In both cases,
the name of the root tag of the XML data is copied into the
variable XMLSTARTDOCT.
<?xml version='1.0'?>
<!DOCTYPE ListOfNames SYSTEM "Names.dtd">
<ListOfNames lang="English">
The distinction between both cases can be made by looking at
the XMLATTR array. Once more, the XMLATTR array
is used for passing the values of several named attributes
(in the widest sense) to the user. If the array has an index
INTERNAL_SUBSET, then the DTD is embedded into the
XML data. Otherwise, the optional entries PUBLIC
and SYSTEM will report the system identifier or the
public identifier of the referenced DTD.
# Find out if DTD exists, and if it is embedded or external.
XMLSTARTDOCT {
root = XMLSTARTDOCT
if ("INTERNAL_SUBSET" in XMLATTR) {
...
} else {
public_id = XMLATTR["PUBLIC"]
system_id = XMLATTR["SYSTEM"]
}
}
In both cases, the DTD itself is not parsed by the XML parser.
But any embedded DTD text is passed as unparsed data
in the variable XMLUNPARSED.
XMLENDDOCT: integer indicates end of DTDWhenever the variable XMLENDDOCT is set, the DTD section
has ended and the XML parser continues parsing the data as tagged
XML data again. The closing ]> of the DTD section is not passed
to the user.
XMLUNPARSED: string holds unparsed charactersVery few parts of the XML data go unparsed by the XML parser.
Any embedded DTD of an XML document will be detected and reported
as such (by setting XMLSTARTDOCT), but the DTD content
itself is reported as unparsed data in XMLUNPARSED.
XMLERROR: string holds textual error descriptionThis string is always empty. It is only set when the XML parser
finds an error in the XML data. The string XMLERROR
contains a textual description of the error. The contents
of this text is informal and not guaranteed to be the same
on all platforms. Whenever XMLERROR is non-empty,
the variables XMLROW and XMLCOL contain the
location of the error in the XML data.
XMLROW: integer holds current row of parsed itemThis integer always contains the number of the line which
is currently parsed. Initially, it is set to 0. Upon opening
the first line of an XML file, it is set to 1 and incremented
with each line in the XML data. The incremental reading of
lines is done by the XML parser. Therefore, the notion of a
line here has nothing to do with the notion of a record
in AWK. The content of XMLROW does not depend on the setting of
RS.
XMLCOL: integer holds current column of parsed itemThis integer always contains the number of the column in the
current line which is currently parsed. Initially, it is set to 0.
Upon opening the first line of an XML file, it is set to 1 and incremented
with each character in the XML data. At the beginning of each line
it is set to 1. The incremental reading of lines is done by the XML
parser. Therefore, the notion of a line here has nothing to do with
the notion of a record in AWK. The content of XMLCOL does not
depend on the setting of FS.
XMLLEN: integer holds length of parsed itemThis integer always contains the number of bytes of the item
which is currently parsed. Initially, it is set to 0. The number
of bytes refers to bytes in the XML data originally parsed.
It is not the same as the number of characters.
After the optional conversion to the character encoding determined
by XMLCHARSET the length in XMLLEN may also be different
from the converted length of the XML data item.
XMLDEPTH: integer holds nesting depth of elementsThis integer always contains the nesting depth of the element which is currently parsed. Initially, upon opening an XML file, it is set to 0. Upon entering the first element of an XML file, it is set to 1 and incremented with each further element (which has not yet been completed). Upon complete parsing of an element, the variable is decremented.
XMLPATH: string holds nested tags of parsed elementsUpon starting the interpreter and opening an XML file, this string
is empty. With each XMLSTARTELEM, the new tag name is appended
to XMLPATH with a "/" character in front of the new tag name.
The "/" character (encoded according to the XML
file's character encoding) serves as a separator between both.
With each XMLENDELEM, the old tag name (and the leading "/")
is chopped off XMLPATH.
The user may change XMLPATH, but any change to XMLPATH
will be overwritten with the next XML data read.
XMLENDDOCUMENT: integer indicates end of XML dataThis integer is always 0. It is only set when the XML parser finds the end of XML data.
XMLEVENT: string holds name of eventThis string always contains the name of the event that is currently
being processed. Valid names are
DECLARATION
STARTDOCT
ENDDOCT
PROCINST
STARTELEM
ENDELEM
CHARDATA
STARTCDATA
ENDCDATA
COMMENT
UNPARSED
ENDDOCUMENT
. The names are closely related to the variables of the same name,
that have an XML prefix.
Any names coming with this event are passed in XMLNAME,
$0, and XMLATTR. For details about which variable
carries which information, See fig:table_style_b.
XMLNAME: string holds name assigned to XMLEVENTThe variable XMLNAME is used for passing data when the
variable XMLEVENT contains the specific event. For details
about which variable carries which information, See fig:table_style_b.
The builtin variables of the previous section were chosen so that they bear analogy to the XML parser Expat's API. Most builtin variables reflect a ”handler” function of Expat's API. If you have ever worked with Expat, you will feel at home with XMLgawk. The only question you will have is how are parameters passed ? This section answers your question with a tabular overview of variable names and details on parameter passing.
To be precise, there are actually two tables in this chapter.
In the first table, you will find variable names that can stand by itself as patterns in a program, triggering an action that handles the respective kind of event. The first column of the table contains the variable's name and the second column contains the variable's value when triggered. All parameters that you can receive from the XML parser are mentioned in the remaining columns.
| Event variable | Value | $0 | XMLATTR index (when supplied) | XMLATTR value
|
|---|---|---|---|---|
| XMLDECLARATION | 1 | VERSION ENCODING STANDALONE | "1.0" encoding name "yes"/"no"
| |
| XMLSTARTDOCT | root element name | PUBLIC SYSTEM INTERNAL_SUBSET | public Id system Id 1 | |
| XMLENDDOCT | 1
| |||
| XMLPROCINST | PI name | PI content
| ||
| XMLSTARTELEM | element name | Ordered list of attribute names | given name(s) | given value(s)
|
| XMLENDELEM | element name
| |||
| XMLCHARDATA | 1 | text data
| ||
| XMLSTARTCDATA | 1
| |||
| XMLENDCDATA | 1
| |||
| XMLCOMMENT | 1 | comment text
| ||
| XMLUNPARSED | 1 | text data
| ||
| XMLENDDOCUMENT | 1
|
Figure 8.1: Variables for passing XML data in style A
Now for the second table of variables. Some people don't like
to remember all the different names in the table above. They
prefer to remember only a minimum of two variable names. While
the first variable (XMLEVENT) contains the kind of event
that happened (STARTELEM for example), the second one
(XMLNAME) passes details about it (like the name of the
element). All events and their parameters are passed in this manner.
But sometimes there is more than just one parameter to be passed,
then we have to rely on $0 and XMLATTR, just like it
was already described in the first table.
| XMLEVENT variable | XMLNAME value | $0 | XMLATTR index (when supplied) | XMLATTR value
|
|---|---|---|---|---|
| DECLARATION | VERSION ENCODING STANDALONE | "1.0" encoding name "yes"/"no" | ||
| STARTDOCT | root element name | PUBLIC SYSTEM INTERNAL_SUBSET | public Id system Id 1 | |
| ENDDOCT
| ||||
| PROCINST | PI name | PI content
| ||
| STARTELEM | element name | Ordered list of attribute names | given name(s) | given value(s)
|
| ENDELEM | element name
| |||
| CHARDATA | text data
| |||
| STARTCDATA
| ||||
| ENDCDATA
| ||||
| COMMENT | comment text
| |||
| UNPARSED | text data
| |||
| ENDDOCUMENT
|
Figure 8.2: Variables for passing XML data in style B
FIXME: This section has not been written yet.
FIXME: This section has not been written yet.
Here is a commented list of books for those who intend to learn more about XML and AWK.
oawk and nawk).
So there was a need for an implementation with
open source code, support by developers and an appropriate documentation.
The GNU Awk implementation is maintained by Arnold Robbins, who also
serves as the author of EAP3, the third edition of the
GNU Awk manual.
This inexpensive book is published by O'Reilly, but it is also distributed
with the source distribution of GNU Awk. Along with this up-to-date source
of information comes a reference card. Unfortunately, the reference card
is not part of O'Reilly's book. No other publication is so precise about
the subtle differences between the POSIX standard, the original AWK and
GNU Awk.
This section lists the URLs for various items discussed throughout this web page.
The xgawk program adds a few features to the basic gawk capabilities. Here is a quick summary.
AWKLIBPATH environment variable
(with an appropriate default value if none is present in the environment).
And it automatically tries to supply a default suffix appropriate for
shared libraries on the build platform.
AWKPATH, it first tries to open the file without
a suffix, and then tries with .awk appended.)
@include directive in the source code. This is
the same feature provided by the current igawk script. This works the
same way as in igawk (and as in the new -i command-line option). So
the igawk script can be removed and replaced with a symbolic link to
the new gawk binary. This is a little more powerful than igawk @include
because it can automatically add the .awk suffix.
@load directive in the source code to load a
shared library. This does the same thing as the new -l command-line
option.
The installation process of Extensible GNU Awk has also added some new options. Note that installing xgawk will not in any way interfere with a previous installation of gawk: all the installed files have distinct names from the files installed by regular gawk. So both can coexist without disturbing each other.
The functions described here are intended to expose the libpq C API as documented at //http://www.postgresql.org/docs/8.0/interactive/libpq.html. This documentation can be understood only in conjunction with the libpq documentation.
This API can be used by either invoking xgawk with a command-line
argument of -l pgsql or by inserting @load pgsql in
your script.
Optional parameters are enclosed in square brackets ([ ]).
pg_connect([conninfo])PQconnectdb function. On success, a unique,
opaque connection handle is returned. On failure, a null string ("")
is returned, and ERRNO is set.
pg_connectdb([conninfo])pg_connect.
pg_disconnect(conn)PQfinish function with the handle indicated by conn.
The conn handle must have been returned by a previous call to
pg_connect. If the handle is not found, then -1 is returned
and ERRNO is set. On success, 0 is returned and the connection
associated with conn is no longer active.
pg_finish(conn)pg_disconnect.
pg_reset(conn)PQreset function with the handle indicated by conn.
The conn handle must have been returned by a previous call to
pg_connect. If the handle is not found, then -1 is returned
and ERRNO is set. Otherwise, PQreset is called. If
the subsequent value returned by PQstatus is CONNECTION_OK,
then 0 is returned, otherwise -1 is returned and ERRNO is set.
pg_reconnect(conn)pg_reset.
pg_errormessage(conn)PQerrorMessage on the specified connection
and returns the result. If the connection is not found, the return
value is a null string (""), and ERRNO is set.
pg_getresult(conn)"") is returned,
and ERRNO is set. Otherwise, PQgetResult is called on
the given connection. If it returns NULL, then it returns a
NULL string (""). If the PGresult returned is non-NULL,
then the return value depends on the value of PQresultStatus(PQgetResult(conn)) as follows:
PGRES_TUPLES_OKTUPLES <# of rows> <unique identifier>. You can find the
number of rows being returned by extracting the 2nd word of the returned
handle, or by calling the pg_ntuples function. The returned
string handle is mapped to a PGresult pointer for use in
subsequently extracting the returned data.
PGRES_COMMAND_OKPQcmdTuples and then we return a string
in this format: OK <result of PQcmdTuples>.
Since there is no data being returned, we call PQclear automatically.
PGRES_EMPTY_QUERYPGRES_COMMAND_OK.
PGRES_COPY_INCOPY_IN <PQnfields(res)> {BINARY|TEXT}.
Since there is no data being returned, we call PQclear automatically.
The user code may subsequently call pg_putcopydata to transmit
bulk data to the server (and use pg_putcopyend to terminate the
transfer).
PGRES_COPY_OUTCOPY_OUT <PQnfields(res)> {BINARY|TEXT}.
Since there is no data being returned, we call PQclear automatically.
The user code should subsequently call pg_getcopydata until
it returns a NULL string ("").
default (unhandled value)ERROR [BADCONN ]<status>
where BADCONN is included if PQstatus(conn) does
not equal CONNECTION_OK, and the subsequent string is the
result of calling PQresStatus(PQresultStatus(res)). And
we set ERRNO to the string returned by PQresultErrorMessage(res).
Since there is no data being returned, we call PQclear automatically.
pg_clear(res)ERRNO is set. Otherwise, PQclear(res) is called
and 0 is returned.
pg_exec(conn, command)"") is returned,
and ERRNO is set. Otherwise, PQexec is called with
the command string. If PQexec returns NULL, then
the returned value will start with the "ERROR ". If PQstatus
does not return CONNECTION_OK, then the next word in the returned
value will be "BADCONN". Then the result of calling
PQresStatus(PQresultStatus(NULL)) will be appended to the string.
In addition, ERRNO will be set to the string returned
by PQerrorMessage. On the other hand, if PQexec does
not return NULL, then the result will be in the standard format returned
by pg_getresult.
pg_execparams(conn, command, nParams [, paramValues])"") is returned,
and ERRNO is set. Otherwise, PQexecParams is called with the
paramTypes, paramLengths, and paramFormats arguments
set to NULL. The paramValues array is used by searching for the
value corresponding to $n in paramValues[n]. The
return value is the same as for pg_exec.
pg_prepare(conn, command)"") is returned,
and ERRNO is set. Otherwise, PQprepare is called with the
command string. If PQprepare returns NULL, or if
PQresultStatus(result) != PGRES_COMMAND_OK, then the
function returns a NULL string "" and sets ERRNO.
Otherwise, if successful, an opaque statement handle is returned that can
be used with pg_execprepared or pg_sendqueryprepared.
pg_execprepared(conn, stmtName, nParams [, paramValues])pg_execparams, except that
it requires a prepared statement handle as the second argument instead
of an SQL command.
pg_sendquery(conn, command)ERRNO is set. Otherwise, PQsendQuery is called with the
given command, and the result is returned (should be 0 for failure
and 1 for success). If the return code is 0, then ERRNO will be set.
You should call pg_getresult to retrieve the results. You must
call pg_getresult until it returns a NULL string ("").
pg_sendqueryparams(conn, command, nParams [, paramValues])ERRNO is set.
Otherwise, PQsendQueryParams is called with the
paramTypes, paramLengths, and paramFormats arguments
set to NULL, and the result is returned. As in pg_execparams,
the value corresponding
to $n should be in paramValues[n].
If the return code is 0, ERRNO will be set.
pg_sendprepare(conn, command)"") is returned,
and ERRNO is set. Otherwise, PQsendPrepare is called with the
command string. If PQsendPrepare returns 0,
then the function returns a NULL string "" and sets ERRNO.
Otherwise, an opaque statement handle is returned that can
be used with pg_sendqueryprepared or pg_execprepared.
You should call pg_getresult to ascertain whether the command
completed successfully.
pg_sendqueryprepared(conn, stmtName, nParams [, paramValues])pg_sendqueryparams, except that
it requires a prepared statement handle as the second argument instead
of an SQL command.
pg_putcopydata(conn, buffer)ERRNO is set. Otherwise, PQputCopyData is called with the
buffer argument, and its value is returned. If PQputCopyData
returns -1, then ERRNO is set.
pg_putcopyend(conn [, errormsg])ERRNO is set. Otherwise, PQputCopyEnd is called with the
optional errormsg argument if supplied,
and its value is returned. If PQputCopyEnd
returns -1, then ERRNO is set.
pg_getcopydata(conn)"") is returned,
and ERRNO is set. Otherwise, PQgetCopyData is called with the
async argument set to FALSE. If PQgetCopyData returns
-1, then the copy is done, and a NULL string ("") is returned (and
the user should call pg_getresult to obtain the final result status
of the COPY command).
If the return code is -2 indicating an error, then a NULL string ("")
is returned, and ERRNO is set. Otherwise, the retrieved row
is returned.
pg_nfields(res)ERRNO is set. Otherwise, the value of PQnfields(res) is
returned.
pg_ntuples(res)ERRNO is set. Otherwise, the value of PQntuples(res) is
returned.
pg_fname(res, column_number)"")
is returned and ERRNO is set. Otherwise, the value of
PQfname(res, column_number) is returned.
pg_getvalue(res, row_number, column_number)"")
is returned and ERRNO is set. Otherwise, the value of
PQgetvalue(res, row_number, column_number) is returned.
pg_getisnull(res, row_number, column_number)ERRNO is set. Otherwise, the value of
PQgetisnull(res, row_number, column_number) is returned (1 if the
data is NULL, and 0 if it is non-NULL).
pg_fields(res, field_names)ERRNO is set. Otherwise, the number of fields
in the result (i.e. pg_nfields(res)) is returned, and
the array field_names is cleared and populated with the column
names as follows: field_names[col] contains the value returned
by PQfname(res, col).
pg_fieldsbyname(res, field_names)ERRNO is set. Otherwise, the number of fields
in the result (i.e. pg_nfields(res)) is returned, and
the array field_names is cleared and populated with the column
names as follows: field_names[PQfname(res, col)] contains col.
pg_getrow(res, row_number, field)ERRNO is set. Otherwise, the number of non-NULL
fields in the row is returned, and the field array is cleared
and populated as follows: field[col_number] contains
PQgetvalue(res, row_number, col_number) for all non-NULL columns.
pg_getrowbyname(res, row_number, field)ERRNO is set. Otherwise, the number of non-NULL
fields in the row is returned, and the field array is cleared
and populated as follows: field[PQfname(res, col_number)] contains
PQgetvalue(res, row_number, col_number) for all non-NULL columns.
These functions can be used by either invoking xgawk
with a command-line argument of -l time or by
inserting @load time in your script.
gettimeofday()ERRNO.
If the standard gettimeofday function is available on this platform,
then it simply returns the value. Otherwise, if on Windows,
it tries to use GetSystemTimeAsFileTime.
sleep(seconds)ERRNO is set.
Otherwise, the function should return 0 after sleeping
for the indicated amount of time. Implementation details: depending
on platform availability, it tries to use nanosleep or select
to implement the delay.
The functions described here are intended to expose the GD graphics API as documented at http://www.boutell.com/gd/manual2.0.33.html. This documentation can be understood only in conjunction with the GD documentation.
These functions can be used by either invoking xgawk
with a command-line argument of -l gd or by
inserting @load gd in your script.
gdImageCreateFromFile(img_dst, file_name)gdImageCreateFromJpeg(),
gdImageCreateFromPng(), or gdImageCreateFromGif().
It returns an image handle or an empty string if fails.
gdImageCopyResampled(img_dst, img_src, dstX, dstY, srcX, srcY, destW, destH, srcW, srcH)gdImageCreateTrueColor(width, height)gdImageDestroy(img)gdImagePngName(img, file_name)gdImagePng().
It returns 0 if succeeds or -1 if fails.
gdImageStringFT(img, brect, fg, fontname, ptsize, angle, x, y, string)gdImageColorAllocate(img, r, g, b)gdImageFilledRectangle(img, x1, y1, x2, y2, color)gdImageSetAntiAliasedDontBlend(img, color, dont_blend)gdImageSetThickness(img, thickness)gdImageSX(img)gdImageSY(img)gdImageCompare(img1, img2)MPFR is a portable library for arbitrary precision arithmetic on floating-point numbers. This means you can use the MPFR extension to perform calculations on numbers with a precision that is much higher (or lower, if you want) than the usual floating point numbers allow (as defined in the IEEE 754-1985 standard) .
To many users, it is not obvious why they should actually need this special kind of numbers with arbitrary precision. Two by two is four, who needs more ? For most calculations in everyday life (summing up prices or distances, calculating gas prices including VAT), the precision of your pocket calculator and your computer is indeed good enough. But if you go just a little further and evaluate the following polynomial, some doubts are cast on the capabilities of your software (example taken from the PASCAL-XSC Language Reference with Examples, page 188).
awk 'BEGIN {x=665857; y=470832; print x^4 - 4 * y^4 - 4 * y^2 }'
11885568
What is so surprising about this result is that it is wrong.
Not just a little bit, but completely wrong when you compare it
to the exact result, which is 1.0. Even worse, the software
doesn't give you any clue. Rest assured that it is
not AWK's fault. AWK relies on the arithmetic implemented in the
underlying operating system (which will produce the same result
no matter which programming language you use for the calculation).
So, what can MPFR do better about this ? First, the problem has to be recast in a different syntax. The usual arithmetic operators of AWK have to be replaced by the equivalent MPFR operators, making it a bit harder to read the program. The following example uses the MPFR extension to evaluate the same polynomial.
gawk -l mpfr 'BEGIN {x=665857; y=470832; \
print mpfr_sub(mpfr_sub(mpfr_pow(x, 4), mpfr_mul(4, mpfr_pow(y, 4))), 4 * y^2)}'
1.1885568000000000E7
By default, the MPFR extension calculates with the same precision (53 bits in the mantissa) as the usual IEEE 754 compatible operators implemented in the operating system. Thus, the result is the same as above. We see no advantage up to now. So, how can we eventually exploit the arbitrary precision capabilities of MPFR ? We have to tell MPFR to use some more bits in the mantissa of the numbers. In this case, 80 bits are enough.
gawk -l mpfr 'BEGIN {MPFR_PRECISION = 80; x=665857; y=470832; \
print mpfr_sub(mpfr_sub(mpfr_pow(x, 4), mpfr_mul(4, mpfr_pow(y, 4))), 4 * y^2)}'
1.0000000000000000000000000
You can see, when calculating with numbers that have 80 bits
in their mantissa, the result of the whole evaluation is correct (1.0).
Now, look at the program again. Notice the end of the polynomial.
The last term of the polynomial has not been recast in terms of
MPFR operators — the usual operators are still used. This example
demonstrates that you can mix ordinary numbers with the long numbers
returned by MPFR. Mixing ordinary numbers with long numbers is quite
convenient and improves readability of the program. But (from an analytic
point of view), this is bad practice. Ordinary numbers are potentially
less precise and one such term in a polynomial might spoil the complete
evaluation. In the case of the polynomial evaluation above it doesn't
matter (because the term is only quadratic in y, requiring not
as long a mantissa as the quadric terms in x and y).
But in the more general cases (where the variables actually vary
and are not constant), you should do the complete evaluation in terms
of MPFR functions.
Let's summarize: MPFR is a portable library for arbitrary precision arithmetic on floating-point numbers. The precision in bits can be set exactly to any valid value for each variable. The semantics of a calculation in MPFR is specified as follows: Compute the requested operation exactly (with infinite accuracy), and round the result to the precision of the destination variable, with the given rounding mode. The MPFR floating-point functions are intended to be a smooth extension of the IEEE 754-1985 arithmetic.
The internal representation of the numbers is not visible to the user of the MPFR extension. To the user, the numbers appear as strings of varying length. As a general rule, all MPFR functions return the result of the numerical calculation as a string containing a number. Each initialization and each calculation of a variable is controlled by the following global variables:
The remaining sections of this appendix contain a list of all
functions provided by the MPFR extension. Notice that only the
functions listed here can actually be used — some obsolete
legacy functions of old MPFR versions are not supported.
Supported functions can only be used after either invoking
xgawk with a command-line argument of -l mpfr
or by inserting @load mpfr in your script. Optional
parameters are enclosed in square brackets ([ ]).
In the following sections, the functions are presented in groups.
The distinction between groups is based on the
arity (the
signature) of the function's parameters and their return
values.
Nullary functions take no (null) arguments, but they return some useful number. These functions are meant to provide you with the best approximation of a specific constant that is possible under the given circumstance (chosen precision, number base and rounding).
The following functions return the base-e logarithm of 2, the value of Pi, of Euler's constant 0.577... respectively, rounded in the currently set direction MPFR_ROUND.
The following examples will not only show you how to use nullary functions, they will also demonstrate the limitations that are inherent to any implementation of arithmetical operators. It will not surprise you that it is easy to print the famous constant Pi to a thousand binary digits.
gawk -l mpfr 'BEGIN {MPFR_PRECISION=1000; print "pi = " mpfr_const_pi() }'
pi = 3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593344612847564823378678316527120190914564856692346034861045432664821339360726024914127360
You could easily change the example to let it print Pi to a million binary digits (the calculation would take just a few seconds more). But what about working with a precision of only four bits ?
gawk -l mpfr 'BEGIN {MPFR_PRECISION=4; print "pi = " mpfr_const_pi() }'
pi = 3.25
You know that the result 3.25 is wrong, but is it really that wrong ?
What actually is the right value of Pi ?
Is it the one in the previous example ? No, none of them
is really exact. Like many other numbers, Pi has
infinitely many places. Every representation of such a
number in floating-point arithmetic can only be an
approximation (3.25 if you have only four bits in the mantissa
and rounding is done to the nearest number). If any calculation
with floating-point numbers returns an exact result
to you, then you were just in luck. Exact results are
an exception, not the rule.
Unary functions take one argument and return some useful number. These functions are meant to provide you with the best approximation of a specific function that is possible under the given circumstance (chosen precision, number base and rounding). The names of the functions in the following list should explain what is meant. In case of doubt, refer to the documentation of the MPFR library.
Binary functions take two arguments and return a result. To many users, these functions are the most commonly needed functions. Among others, they provide the four elementary arithmetic operations: addition, subtraction, multiplication and division. These functions are meant to provide you with the best approximation of a specific function that is possible under the given circumstance (chosen precision, number base and rounding). The names of the functions in the following list should explain what is meant. In case of doubt, refer to the documentation of the MPFR library.
Predicates are boolean-valued functions. They are indicator-functions, testing for some condition and revealing the presence or absence of the condition. Most importantly, error conditions can be checked by using the following functions. Notice that nullary predicates take no argument. They check for a global condition that is unrelated to any specific number.
Unary predicates are similar to nullary predicates in that they detect the presence or absence of a specific condition. But in unary predicates, this condition is bound to a specific number. The importance of these predicates is often underestimated by beginners. For example, detecting a result that is NaN (not a number) may be important. Another subtle question is the equality of numbers, especially equality to zero. In case of doubt, look up the documentation of the MPFR library.
Binary predicates are the most common indicator functions.
They allow you to detect equality of two arguments.
Notice the slight difference between testing for equality
and comparing two arguments (with mpfr_cmp()).
Also notice that any number can be NaN (not a number)
and comparing to NaN or Inf is dubious.
Conversion between internal and external representation.
The source code of the most recent version of xgawk can always be downloaded from the project's web page at SourceForge. Proceed in the same way as you would do for Arnold's GAWK distribution files.
wget http://switch.dl.sourceforge.net/sourceforge/xmlgawk/xgawk-3.1.6-20080101.tar.gz
tar zxvf xgawk-3.1.6-20080101.tar.gz
cd xgawk-3.1.6-20080101
./configure
make
make check
su
make install
Most users already have Arnold's GAWK installed in the default location (/usr/bin/ of their operating system. You may wonder what happens if you invoke make install. Will the original GAWK be overwritten ? No, xgawk will be installed into the same location, but it will be stored separately, nothing will be overwritten. If you are still suspicious and don't want to install xgawk into the same location as your original GAWK, then proceed like this and xgawk will be installed under /tmp.
./configure --prefix=/tmp/
make install
The manual (the document at hand) can be produced in various formats from the file doc/xmlgawk.texi. Notice that your build environment needs a modern implementation (texinfo-4.9) of texinfo to be able to produce these derived document files. But you can also download these files (most recent version in A4 page size) from the Internet.
cd doc
make postscript
make pdf
make html
wget http://home.vrweb.de/~juergen.kahrs/gawk/XML/xmlgawk.ps
wget http://home.vrweb.de/~juergen.kahrs/gawk/XML/xmlgawk.pdf
wget http://home.vrweb.de/~juergen.kahrs/gawk/XML/xmlgawk.html
xgawk should build on all platforms that also support Arnold's GAWK. The most common problem on all platforms is support for dynamic libraries. If this doesn't work for you, try to build with dynamic libraries switched off and static extensions switched on.
./configure --disable-shared --enable-static-extensions
If this build succeeds, then you have all available extensions built into the executable file of the interpreter.
This is the development platform of Jürgen Kahrs. All recent versions of SuSE or openSUSE are able to build xgawk without any problems.
./configure --disable-gd
The build runs fine with both of these Enterprise platforms. Only the regression testing reveals four test cases that fail. You may choose to ignore this or upgrade to the more recent Enterprise editions of the highly esteemed companies RedHat and Novell.
Stefan Tramm reported a successful build of xgawk on some Apple Macintosh platforms. It is most noticable that he managed to build extensions statically as well as dynamically.
Compiles and passes all(*) tests on Mac OS X 10.5.1 on
Intel Core2 Duo for both 32bit and 64bit executables.
With (32Bit only) and without XML extension and also
statically and dynamically linked.
Still cant get GMP and MPFR stuff to work :-(
SQL extension was not tested.
Mac OS X 10.4.11 with 32bit PPC also OK,
with and without XML extension
statically and dynamically linked.
Stefan Tramm also tried a build on a not-so-recent version of Solaris and somehow managed to succeed.
Solaris 2.8 on an Ultra Sparc IIe was an adventure.
gcc-2.95.3 was not able to compile the sources.
Sun workshop compiler 4.2 was able to compile the sources
with the exception of the generated xml_enc_handler.c
include files, which I compiled manually with gcc.
Finally I had a statically linked xgawk with XML
extension running, which passed all tests :)
Jürgen Kahrs built xgawk on the SPARC versions of Solaris 8 and Solaris 10. The build went without problems. Only the regression testing needed a small fix. Change line 1720 in the file test/Makefile. Remove the dash symbol ("-") in front of the include. After this, the regression tests run smoothly and report no problems.
We recommend using xgawk with the Cygwin tool set.
Use the options --enable-static-extensions and --disable-shared to enforce the use of statically linked extension libraries into the executable. On platforms like Cygwin, this option is the only way to work with extensions. These platforms have problems in handling dynamic libraries.
By default, all extensions will be built if the required libraries exist on the platform. If any of the libraries installed on your host cause problems, switch them off with the options --disable-xml, --disable-pgsql, --disable-mpfr, and --disable-gd
Manuel Collado and Victor Paesa both reported successful builds of xgawk with Cygwin. While Manuel succeeded with a more conservative development environment (gcc 3.4.4-1), Victor used a more recent environment and managed to get all extension compiled as described below.
It compiled OK under Cygwin, using:
./configure --disable-shared --enable-static-extensions
All extensions (filefuncs fork ordchr readfile time xml pgsql mpfr gd)
were compiled.
make check passed all tests, except the warning test:
"Bad news: this system does not comply with the spec,
and it is not consistent in its behavior.
It converts the special values to zero in 2
of the 6 cases, and it converts them
properly in the other 4 cases."
Tool versions:
Cygwin 1.5.25-7
binutils 20060817-1
libtool1.5 1.5.23a-2
gcc was 4.2.2 (compiled from source), not the 3.4.4-3 packaged with Cygwin.
Note that after building xgawk, the GAWK.EXE should exist in the shell path (just for the sake of convenience). Thus after installation, it is recommended that copies of GAWK.EXE, libconv2.dll, and libintl3.dll (found in C:\Program Files\GnuWin32\bin in default installations) be placed in the %WINDIR%\system32 directory.
Copyright © 2000,2001,2002 Free Software Foundation, Inc.
59 Temple Place, Suite 330, Boston, MA 02111-1307, USA
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
The purpose of this License is to make a manual, textbook, or other functional and useful document free in the sense of freedom: to assure everyone the effective freedom to copy and redistribute it, with or without modifying it, either commercially or noncommercially. Secondarily, this License preserves for the author and publisher a way to get credit for their work, while not being considered responsible for modifications made by others.
This License is a kind of “copyleft”, which means that derivative works of the document must themselves be free in the same sense. It complements the GNU General Public License, which is a copyleft license designed for free software.
We have designed this License in order to use it for manuals for free software, because free software needs free documentation: a free program should come with manuals providing the same freedoms that the software does. But this License is not limited to software manuals; it can be used for any textual work, regardless of subject matter or whether it is published as a printed book. We recommend this License principally for works whose purpose is instruction or reference.
This License applies to any manual or other work, in any medium, that contains a notice placed by the copyright holder saying it can be distributed under the terms of this License. Such a notice grants a world-wide, royalty-free license, unlimited in duration, to use that work under the conditions stated herein. The “Document”, below, refers to any such manual or work. Any member of the public is a licensee, and is addressed as “you”. You accept the license if you copy, modify or distribute the work in a way requiring permission under copyright law.
A “Modified Version” of the Document means any work containing the Document or a portion of it, either copied verbatim, or with modifications and/or translated into another language.
A “Secondary Section” is a named appendix or a front-matter section of the Document that deals exclusively with the relationship of the publishers or authors of the Document to the Document's overall subject (or to related matters) and contains nothing that could fall directly within that overall subject. (Thus, if the Document is in part a textbook of mathematics, a Secondary Section may not explain any mathematics.) The relationship could be a matter of historical connection with the subject or with related matters, or of legal, commercial, philosophical, ethical or political position regarding them.
The “Invariant Sections” are certain Secondary Sections whose titles are designated, as being those of Invariant Sections, in the notice that says that the Document is released under this License. If a section does not fit the above definition of Secondary then it is not allowed to be designated as Invariant. The Document may contain zero Invariant Sections. If the Document does not identify any Invariant Sections then there are none.
The “Cover Texts” are certain short passages of text that are listed, as Front-Cover Texts or Back-Cover Texts, in the notice that says that the Document is released under this License. A Front-Cover Text may be at most 5 words, and a Back-Cover Text may be at most 25 words.
A “Transparent” copy of the Document means a machine-readable copy, represented in a format whose specification is available to the general public, that is suitable for revising the document straightforwardly with generic text editors or (for images composed of pixels) generic paint programs or (for drawings) some widely available drawing editor, and that is suitable for input to text formatters or for automatic translation to a variety of formats suitable for input to text formatters. A copy made in an otherwise Transparent file format whose markup, or absence of markup, has been arranged to thwart or discourage subsequent modification by readers is not Transparent. An image format is not Transparent if used for any substantial amount of text. A copy that is not “Transparent” is called “Opaque”.
Examples of suitable formats for Transparent copies include plain ascii without markup, Texinfo input format, LaTeX input format, SGML or XML using a publicly available DTD, and standard-conforming simple HTML, PostScript or PDF designed for human modification. Examples of transparent image formats include PNG, XCF and JPG. Opaque formats include proprietary formats that can be read and edited only by proprietary word processors, SGML or XML for which the DTD and/or processing tools are not generally available, and the machine-generated HTML, PostScript or PDF produced by some word processors for output purposes only.
The “Title Page” means, for a printed book, the title page itself, plus such following pages as are needed to hold, legibly, the material this License requires to appear in the title page. For works in formats which do not have any title page as such, “Title Page” means the text near the most prominent appearance of the work's title, preceding the beginning of the body of the text.
A section “Entitled XYZ” means a named subunit of the Document whose title either is precisely XYZ or contains XYZ in parentheses following text that translates XYZ in another language. (Here XYZ stands for a specific section name mentioned below, such as “Acknowledgements”, “Dedications”, “Endorsements”, or “History”.) To “Preserve the Title” of such a section when you modify the Document means that it remains a section “Entitled XYZ” according to this definition.
The Document may include Warranty Disclaimers next to the notice which states that this License applies to the Document. These Warranty Disclaimers are considered to be included by reference in this License, but only as regards disclaiming warranties: any other implication that these Warranty Disclaimers may have is void and has no effect on the meaning of this License.
You may copy and distribute the Document in any medium, either commercially or noncommercially, provided that this License, the copyright notices, and the license notice saying this License applies to the Document are reproduced in all copies, and that you add no other conditions whatsoever to those of this License. You may not use technical measures to obstruct or control the reading or further copying of the copies you make or distribute. However, you may accept compensation in exchange for copies. If you distribute a large enough number of copies you must also follow the conditions in section 3.
You may also lend copies, under the same conditions stated above, and you may publicly display copies.
If you publish printed copies (or copies in media that commonly have printed covers) of the Document, numbering more than 100, and the Document's license notice requires Cover Texts, you must enclose the copies in covers that carry, clearly and legibly, all these Cover Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on the back cover. Both covers must also clearly and legibly identify you as the publisher of these copies. The front cover must present the full title with all words of the title equally prominent and visible. You may add other material on the covers in addition. Copying with changes limited to the covers, as long as they preserve the title of the Document and satisfy these conditions, can be treated as verbatim copying in other respects.
If the required texts for either cover are too voluminous to fit legibly, you should put the first ones listed (as many as fit reasonably) on the actual cover, and continue the rest onto adjacent pages.
If you publish or distribute Opaque copies of the Document numbering more than 100, you must either include a machine-readable Transparent copy along with each Opaque copy, or state in or with each Opaque copy a computer-network location from which the general network-using public has access to download using public-standard network protocols a complete Transparent copy of the Document, free of added material. If you use the latter option, you must take reasonably prudent steps, when you begin distribution of Opaque copies in quantity, to ensure that this Transparent copy will remain thus accessible at the stated location until at least one year after the last time you distribute an Opaque copy (directly or through your agents or retailers) of that edition to the public.
It is requested, but not required, that you contact the authors of the Document well before redistributing any large number of copies, to give them a chance to provide you with an updated version of the Document.
You may copy and distribute a Modified Version of the Document under the conditions of sections 2 and 3 above, provided that you release the Modified Version under precisely this License, with the Modified Version filling the role of the Document, thus licensing distribution and modification of the Modified Version to whoever possesses a copy of it. In addition, you must do these things in the Modified Version:
If the Modified Version includes new front-matter sections or appendices that qualify as Secondary Sections and contain no material copied from the Document, you may at your option designate some or all of these sections as invariant. To do this, add their titles to the list of Invariant Sections in the Modified Version's license notice. These titles must be distinct from any other section titles.
You may add a section Entitled “Endorsements”, provided it contains nothing but endorsements of your Modified Version by various parties—for example, statements of peer review or that the text has been approved by an organization as the authoritative definition of a standard.
You may add a passage of up to five words as a Front-Cover Text, and a passage of up to 25 words as a Back-Cover Text, to the end of the list of Cover Texts in the Modified Version. Only one passage of Front-Cover Text and one of Back-Cover Text may be added by (or through arrangements made by) any one entity. If the Document already includes a cover text for the same cover, previously added by you or by arrangement made by the same entity you are acting on behalf of, you may not add another; but you may replace the old one, on explicit permission from the previous publisher that added the old one.
The author(s) and publisher(s) of the Document do not by this License give permission to use their names for publicity for or to assert or imply endorsement of any Modified Version.
You may combine the Document with other documents released under this License, under the terms defined in section 4 above for modified versions, provided that you include in the combination all of the Invariant Sections of all of the original documents, unmodified, and list them all as Invariant Sections of your combined work in its license notice, and that you preserve all their Warranty Disclaimers.
The combined work need only contain one copy of this License, and multiple identical Invariant Sections may be replaced with a single copy. If there are multiple Invariant Sections with the same name but different contents, make the title of each such section unique by adding at the end of it, in parentheses, the name of the original author or publisher of that section if known, or else a unique number. Make the same adjustment to the section titles in the list of Invariant Sections in the license notice of the combined work.
In the combination, you must combine any sections Entitled “History” in the various original documents, forming one section Entitled “History”; likewise combine any sections Entitled “Acknowledgements”, and any sections Entitled “Dedications”. You must delete all sections Entitled “Endorsements.”
You may make a collection consisting of the Document and other documents released under this License, and replace the individual copies of this License in the various documents with a single copy that is included in the collection, provided that you follow the rules of this License for verbatim copying of each of the documents in all other respects.
You may extract a single document from such a collection, and distribute it individually under this License, provided you insert a copy of this License into the extracted document, and follow this License in all other respects regarding verbatim copying of that document.
A compilation of the Document or its derivatives with other separate and independent documents or works, in or on a volume of a storage or distribution medium, is called an “aggregate” if the copyright resulting from the compilation is not used to limit the legal rights of the compilation's users beyond what the individual works permit. When the Document is included in an aggregate, this License does not apply to the other works in the aggregate which are not themselves derivative works of the Document.
If the Cover Text requirement of section 3 is applicable to these copies of the Document, then if the Document is less than one half of the entire aggregate, the Document's Cover Texts may be placed on covers that bracket the Document within the aggregate, or the electronic equivalent of covers if the Document is in electronic form. Otherwise they must appear on printed covers that bracket the whole aggregate.
Translation is considered a kind of modification, so you may distribute translations of the Document under the terms of section 4. Replacing Invariant Sections with translations requires special permission from their copyright holders, but you may include translations of some or all Invariant Sections in addition to the original versions of these Invariant Sections. You may include a translation of this License, and all the license notices in the Document, and any Warranty Disclaimers, provided that you also include the original English version of this License and the original versions of those notices and disclaimers. In case of a disagreement between the translation and the original version of this License or a notice or disclaimer, the original version will prevail.
If a section in the Document is Entitled “Acknowledgements”, “Dedications”, or “History”, the requirement (section 4) to Preserve its Title (section 1) will typically require changing the actual title.
You may not copy, modify, sublicense, or distribute the Document except as expressly provided for under this License. Any other attempt to copy, modify, sublicense or distribute the Document is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance.
The Free Software Foundation may publish new, revised versions of the GNU Free Documentation License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. See http://www.gnu.org/copyleft/.
Each version of the License is given a distinguishing version number. If the Document specifies that a particular numbered version of this License “or any later version” applies to it, you have the option of following the terms and conditions either of that specified version or of any later version that has been published (not as a draft) by the Free Software Foundation. If the Document does not specify a version number of this License, you may choose any version ever published (not as a draft) by the Free Software Foundation.
To use this License in a document you have written, include a copy of the License in the document and put the following copyright and license notices just after the title page:
Copyright (C) year your name.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.2
or any later version published by the Free Software Foundation;
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
Texts. A copy of the license is included in the section entitled ``GNU
Free Documentation License''.
If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts, replace the “with...Texts.” line with this:
with the Invariant Sections being list their titles, with
the Front-Cover Texts being list, and with the Back-Cover Texts
being list.
If you have Invariant Sections without Cover Texts, or some other combination of the three, merge those two alternatives to suit the situation.
If your document contains nontrivial examples of program code, we recommend releasing these examples in parallel under your choice of free software license, such as the GNU General Public License, to permit their use in free software.
CDATA, unescaped Character Data: XML features built into the gawk interpreterchmod: How does XML fit into AWK's execution model ?db_version.awk: Dealing with DTDsdemo_pusher.awk: Sorting out all kinds of data from an XML filedot: Converting XML data into tree drawingsdtd_generator.awk: Generating a DTD from a sample fileeint mpfr extension function: Unary FunctionsExpat, XML parser with SAX-like API: XMLgawk Core Language Interface SummaryExpat, XML parser with SAX-like API: Pulling data out of an XML fileExpat, XML parser with SAX-like API: Printing an outline of an XML fileExpat, XML parser with SAX-like API: Jan Weber's getXML scriptExpat, XML parser with SAX-like API: Prefaceextract_characters.awk: Character data and encoding of character setsgdImageColorAllocate GD graphics extension function: GD Graphics Extension ReferencegdImageCompare GD graphics extension function: GD Graphics Extension ReferencegdImageCopyResampled GD graphics extension function: GD Graphics Extension ReferencegdImageCreateFromFile GD graphics extension function: GD Graphics Extension ReferencegdImageCreateTrueColor GD graphics extension function: GD Graphics Extension ReferencegdImageDestroy GD graphics extension function: GD Graphics Extension ReferencegdImageFilledRectangle GD graphics extension function: GD Graphics Extension ReferencegdImagePngName GD graphics extension function: GD Graphics Extension ReferencegdImageSetAntiAliasedDontBlend GD graphics extension function: GD Graphics Extension ReferencegdImageSetThickness GD graphics extension function: GD Graphics Extension ReferencegdImageStringFT GD graphics extension function: GD Graphics Extension ReferencegdImageSX GD graphics extension function: GD Graphics Extension ReferencegdImageSY GD graphics extension function: GD Graphics Extension Referenceget_rss_feed.awk: Reading an RSS news feedgettimeofday time extension function: Time Extension ReferencegetXML: Jan Weber's getXML scriptgetXMLEVENT.awk: A portable subset of XMLgawkLANG, environment variable: XML features built into the gawk interpreterLANG, environment variable: Character data and encoding of character setsmax_depth.awk: How to traverse the tree with gawkmodify.awk: Copying and Modifying with the <samp><span class="file">xmlcopy.awk</span></samp> library scriptmodify.xml: Copying and Modifying with the <samp><span class="file">xmlcopy.awk</span></samp> library scriptmpfr_abs mpfr extension function: Unary Functionsmpfr_acos mpfr extension function: Unary Functionsmpfr_add mpfr extension function: Binary Functionsmpfr_asin mpfr extension function: Unary Functionsmpfr_atan mpfr extension function: Unary Functionsmpfr_atan2 mpfr extension function: Binary Functionsmpfr_ceil mpfr extension function: Unary Functionsmpfr_cmp mpfr extension function: Binary Functionsmpfr_cmpabs mpfr extension function: Binary Functionsmpfr_const_euler mpfr extension function: Nullary Functionsmpfr_const_log2 mpfr extension function: Nullary Functionsmpfr_const_pi mpfr extension function: Nullary Functionsmpfr_cos mpfr extension function: Unary Functionsmpfr_div mpfr extension function: Binary Functionsmpfr_equal_p mpfr extension function: Binary Predicatesmpfr_erangeflag_p mpfr extension function: Nullary Predicatesmpfr_erf mpfr extension function: Unary Functionsmpfr_erfc mpfr extension function: Unary Functionsmpfr_exp mpfr extension function: Unary Functionsmpfr_exp10 mpfr extension function: Unary Functionsmpfr_exp2 mpfr extension function: Unary Functionsmpfr_floor mpfr extension function: Unary Functionsmpfr_frac mpfr extension function: Unary Functionsmpfr_gamma mpfr extension function: Unary Functionsmpfr_greater_p mpfr extension function: Binary Predicatesmpfr_greaterequal_p mpfr extension function: Binary Predicatesmpfr_hypot mpfr extension function: Binary Functionsmpfr_inexflag_p mpfr extension function: Nullary Predicatesmpfr_inf_p mpfr extension function: Unary Predicatesmpfr_inp_str mpfr extension function: Input and Output of Numbersmpfr_less_p mpfr extension function: Binary Predicatesmpfr_lessequal_p mpfr extension function: Binary Predicatesmpfr_lessgreater_p mpfr extension function: Binary Predicatesmpfr_lngamma mpfr extension function: Unary Functionsmpfr_log mpfr extension function: Unary Functionsmpfr_log10 mpfr extension function: Unary Functionsmpfr_log2 mpfr extension function: Unary Functionsmpfr_max mpfr extension function: Binary Functionsmpfr_min mpfr extension function: Binary Functionsmpfr_mul mpfr extension function: Binary Functionsmpfr_nan_p mpfr extension function: Unary Predicatesmpfr_nanflag_p mpfr extension function: Nullary Predicatesmpfr_neg mpfr extension function: Unary Functionsmpfr_number_p mpfr extension function: Unary Predicatesmpfr_out_str mpfr extension function: Input and Output of Numbersmpfr_overflow_p mpfr extension function: Nullary Predicatesmpfr_pow mpfr extension function: Binary Functionsmpfr_rint mpfr extension function: Unary Functionsmpfr_round mpfr extension function: Unary Functionsmpfr_sgn mpfr extension function: Unary Predicatesmpfr_sin mpfr extension function: Unary Functionsmpfr_sqr mpfr extension function: Unary Functionsmpfr_sqrt mpfr extension function: Unary Functionsmpfr_sub mpfr extension function: Binary Functionsmpfr_tan mpfr extension function: Unary Functionsmpfr_trunc mpfr extension function: Unary Functionsmpfr_underflow_p mpfr extension function: Nullary Predicatesmpfr_unordered_p mpfr extension function: Binary Predicatesmpfr_zero_p mpfr extension function: Unary Predicatesnawk: Good Booksnawk: Jan Weber's getXML scriptnode_count.awk: How to traverse the tree with gawkoutline, Expat application: Printing an outline of an XML fileoutline.awk: Links to the Internetoutline.awk: Printing an outline of an XML fileoutline_dot.awk: Converting XML data into tree drawingsoutline_puller.awk: Pulling data out of an XML filepg_clear pgsql extension function: Command Execution Functionspg_connect pgsql extension function: Database Connection Control Functionspg_connectdb pgsql extension function: Database Connection Control Functionspg_disconnect pgsql extension function: Database Connection Control Functionspg_errormessage pgsql extension function: Connection Status Functionspg_exec pgsql extension function: Command Execution Functionspg_execparams pgsql extension function: Command Execution Functionspg_execprepared pgsql extension function: Command Execution Functionspg_fields pgsql extension function: Higher-level Functions to Retrieve Query Results Using Arrayspg_fieldsbyname pgsql extension function: Higher-level Functions to Retrieve Query Results Using Arrayspg_finish pgsql extension function: Database Connection Control Functionspg_fname pgsql extension function: Retrieving Query Result Informationpg_getcopydata pgsql extension function: Functions for Sending and Receiving COPY Datapg_getisnull pgsql extension function: Retrieving Query Result Informationpg_getresult pgsql extension function: Command Execution Functionspg_getrow pgsql extension function: Higher-level Functions to Retrieve Query Results Using Arrayspg_getrowbyname pgsql extension function: Higher-level Functions to Retrieve Query Results Using Arrayspg_getvalue pgsql extension function: Retrieving Query Result Informationpg_nfields pgsql extension function: Retrieving Query Result Informationpg_ntuples pgsql extension function: Retrieving Query Result Informationpg_prepare pgsql extension function: Command Execution Functionspg_putcopydata pgsql extension function: Functions for Sending and Receiving COPY Datapg_putcopyend pgsql extension function: Functions for Sending and Receiving COPY Datapg_reconnect pgsql extension function: Database Connection Control Functionspg_reset pgsql extension function: Database Connection Control Functionspg_sendprepare pgsql extension function: Asynchronous Command Processingpg_sendquery pgsql extension function: Asynchronous Command Processingpg_sendqueryparams pgsql extension function: Asynchronous Command Processingpg_sendqueryprepared pgsql extension function: Asynchronous Command ProcessingPostgreSQL: PostgreSQL API ReferencePostgreSQL: Loading XML data into PostgreSQLPQclear libpq function: Command Execution FunctionsPQconnectdb libpq function: Database Connection Control FunctionsPQerrorMessage libpq function: Connection Status FunctionsPQexec libpq function: Command Execution FunctionsPQexecParams libpq function: Command Execution FunctionsPQfinish libpq function: Database Connection Control FunctionsPQfname libpq function: Retrieving Query Result InformationPQgetCopyData libpq function: Functions for Sending and Receiving COPY DataPQgetisnull libpq function: Retrieving Query Result InformationPQgetResult libpq function: Command Execution FunctionsPQgetvalue libpq function: Retrieving Query Result InformationPQnfields libpq function: Retrieving Query Result InformationPQntuples libpq function: Retrieving Query Result InformationPQprepare libpq function: Command Execution FunctionsPQputCopyData libpq function: Functions for Sending and Receiving COPY DataPQputCopyEnd libpq function: Functions for Sending and Receiving COPY DataPQreset libpq function: Database Connection Control FunctionsPQsendPrepare libpq function: Asynchronous Command ProcessingPQsendQuery libpq function: Asynchronous Command ProcessingPQsendQueryParams libpq function: Asynchronous Command ProcessingPQsendQueryPrepared libpq function: Asynchronous Command Processingremote_data.xml: Copying and Modifying with the <samp><span class="file">xmlcopy.awk</span></samp> library scriptsample.xml: Loading XML data into PostgreSQLsleep time extension function: Time Extension Referencesoap_book_price_quote.awk: Using a service via SOAPSQL: Loading XML data into PostgreSQLtestxml2pgsql.awk: Loading XML data into PostgreSQLUS-ASCII: Character data and encoding of character setsUTF-16: Character data and encoding of character setsUTF-8: XML features built into the gawk interpreterUTF-8: Character data and encoding of character setsUTF-8: Looking closer at the XML filewc: How does XML fit into AWK's execution model ?wc.awk: How does XML fit into AWK's execution model ?well_formed.awk: Checking for well-formednessXMLATTR: XML features built into the gawk interpreterXMLATTR: Generating a DTD from a sample fileXMLATTR: Converting XML data into tree drawingsXMLATTR: Loading XML data into PostgreSQLXMLATTR: Finding the minimum value of a set of dataXMLATTR: Convert XMLTV file to tabbed ASCIIXMLATTR: Extract the elements where i="Y"XMLATTR: Sorting out all kinds of data from an XML fileXMLATTR: Dealing with DTDsXMLATTR: Pulling data out of an XML fileXMLATTR: Printing an outline of an XML fileXMLATTR["ENCODING"]: XML features built into the gawk interpreterXMLATTR["ENCODING"]: Copying and Modifying with the <samp><span class="file">xmlcopy.awk</span></samp> library scriptXMLATTR["ENCODING"]: Sorting out all kinds of data from an XML fileXMLATTR["ENCODING"]: Dealing with DTDsXMLATTR["ENCODING"]: Character data and encoding of character setsXMLATTR["INTERNAL_SUBSET"]: Sorting out all kinds of data from an XML fileXMLATTR["INTERNAL_SUBSET"]: Dealing with DTDsXMLATTR["PUBLIC"]: XML features built into the gawk interpreterXMLATTR["PUBLIC"]: Sorting out all kinds of data from an XML fileXMLATTR["PUBLIC"]: Dealing with DTDsXMLATTR["STANDALONE"]: XML features built into the gawk interpreterXMLATTR["STANDALONE"]: Sorting out all kinds of data from an XML fileXMLATTR["STANDALONE"]: Dealing with DTDsXMLATTR["SYSTEM"]: XML features built into the gawk interpreterXMLATTR["SYSTEM"]: Sorting out all kinds of data from an XML fileXMLATTR["SYSTEM"]: Dealing with DTDsXMLATTR["VERSION"]: XML features built into the gawk interpreterXMLATTR["VERSION"]: Sorting out all kinds of data from an XML fileXMLATTR["VERSION"]: Dealing with DTDsXMLBooster: Generating a recursive descent parser from a sample fileXMLCHARDATA: XML features built into the gawk interpreterXMLCHARDATA: Sorting out all kinds of data from an XML fileXMLCHARSET: XML features built into the gawk interpreterXMLCHARSET: Sorting out all kinds of data from an XML fileXMLCHARSET: Character data and encoding of character setsXMLCOL: XML features built into the gawk interpreterXMLCOL: Sorting out all kinds of data from an XML fileXMLCOMMENT: XML features built into the gawk interpreterXMLCOMMENT: Sorting out all kinds of data from an XML fileXMLDECLARATION: XML features built into the gawk interpreterXMLDECLARATION: Sorting out all kinds of data from an XML fileXMLDECLARATION: Character data and encoding of character setsXMLDEPTH: XML features built into the gawk interpreterXMLDEPTH: Generating a DTD from a sample fileXMLDEPTH: Converting XML data into tree drawingsXMLDEPTH: Sorting out all kinds of data from an XML fileXMLENDCDATA: XML features built into the gawk interpreterXMLENDCDATA: Sorting out all kinds of data from an XML fileXMLENDDOCT: XML features built into the gawk interpreterXMLENDDOCT: Sorting out all kinds of data from an XML fileXMLENDDOCUMENT: XML features built into the gawk interpreterXMLENDDOCUMENT: Sorting out all kinds of data from an XML fileXMLENDELEM: XML features built into the gawk interpreterXMLENDELEM: Sorting out all kinds of data from an XML fileXMLENDELEM: How to traverse the tree with gawkXMLERROR: XML features built into the gawk interpreterXMLERROR: Sorting out all kinds of data from an XML fileXMLERROR: Checking for well-formednessXMLEVENT: XML features built into the gawk interpreterXMLEVENT: Loading XML data into PostgreSQLXMLEVENT: Sorting out all kinds of data from an XML fileXMLEVENT: Pulling data out of an XML fileXMLLEN: XML features built into the gawk interpreterXMLLEN: Sorting out all kinds of data from an XML filexmllint: Checking for well-formednessXMLMODE: XML features built into the gawk interpreterXMLMODE: Sorting out all kinds of data from an XML fileXMLNAME: XML features built into the gawk interpreterXMLNAME: Sorting out all kinds of data from an XML filexmlparse.awk: Steve Coile's xmlparse.awk scriptXMLPATH: XML features built into the gawk interpreterXMLPATH: Sorting out all kinds of data from an XML fileXMLPROCINST: XML features built into the gawk interpreterXMLPROCINST: Sorting out all kinds of data from an XML fileXMLROW: XML features built into the gawk interpreterXMLROW: Sorting out all kinds of data from an XML fileXMLSTARTCDATA: XML features built into the gawk interpreterXMLSTARTCDATA: Sorting out all kinds of data from an XML fileXMLSTARTDOCT: XML features built into the gawk interpreterXMLSTARTDOCT: Sorting out all kinds of data from an XML fileXMLSTARTELEM: XML features built into the gawk interpreterXMLSTARTELEM: Sorting out all kinds of data from an XML fileXMLSTARTELEM: How to traverse the tree with gawkXMLUNPARSED: XML features built into the gawk interpreterXMLUNPARSED: Sorting out all kinds of data from an XML filexsv: Checking for well-formedness