XML::LibXML
Matt
Sergeant
Christian
Glahn
Petr
Pajas
2.0207
2001-2007
AxKit.com Ltd
2002-2006
Christian Glahn
2006-2009
Petr Pajas
Introduction
README
This module implements a Perl interface to the Gnome
libxml2 library which provides
interfaces for parsing and manipulating XML files. This
module allows Perl programmers to make use of its highly
capable validating XML parser and its high performance DOM
implementation.
Important Notes
XML::LibXML was almost entirely reimplemented between version 1.40 to version 1.49. This may cause problems on some production machines. With
version 1.50 a lot of compatibility fixes were applied, so programs written for XML::LibXML 1.40 or less should run with version 1.50 again.
In 1.59, a new callback API was introduced. This new API is not compatible with the previous one.
See XML::LibXML::InputCallback manual page for details.
In 1.61 the XML::LibXML::XPathContext module, previously distributed separately, was merged in.
An experimental support for Perl threads introduced in 1.66 has been replaced in 1.67.
Dependencies
Prior to installation you MUST have installed the libxml2 library. You can get the latest libxml2 version from
http://xmlsoft.org/
Without libxml2 installed this module will neither build nor run.
Also XML::LibXML requires the following packages:
XML::SAX - base class for SAX parsers
XML::NamespaceSupport - namespace support for SAX parsers
These packages are required. If one is missing some tests will fail.
Again, libxml2 is required to make XML::LibXML work. The library is not just required to build XML::LibXML, it has to be accessible during
run-time as well. Because of this you need to make sure libxml2 is installed properly. To test this, run the xmllint program on your system. xmllint
is shipped with libxml2 and therefore should be available.
For building the module you will also need the header file for libxml2, which in binary
(.rpm,.deb) etc. distributions usually dwell in a package named libxml2-devel or similar.
Installation
(These instructions are for UNIX and GNU/Linux systems. For MSWin32,
See Notes for Microsoft Windows below.)
To install XML::LibXML just follow the standard installation routine for Perl modules:
perl Makefile.PL
make
make test
make install # as superuser
Note that XML::LibXML is an XS based Perl extension and you need a C compiler
to build it.
Note also that you should rebuild XML::LibXML if you upgrade libxml2
in order to avoid problems with possible binary incompatibilities between releases of the library.
Notes on libxml2 versions
XML::LibXML requires at least
libxml2 2.6.16 to compile and pass all tests and
at least 2.6.21 is required for XML::LibXML::Reader.
For some older OS versions this means that an
update of the pre-built packages is required.
Although libxml2 claims binary compatibility between
its patch levels, it is a good idea to recompile XML::LibXML
and run its tests after an upgrade of libxml2.
If your libxml2 installation is not within your $PATH,
you can pass the XMLPREFIX=$YOURLIBXMLPREFIX parameter to Makefile.PL
determining the correct libxml2 version in use. e.g.
perl Makefile.PL XMLPREFIX=/usr/brand-new
will ask '/usr/brand-new/bin/xml2-config' about your real libxml2 configuration.
Try to avoid setting INC and LIBS directly on the
command-line, for if used, Makefile.PL does not check
the libxml2 version for compatibility with XML::LibXML.
Which version of libxml2 should be used?
XML::LibXML is tested against a couple versions of
libxml2 before it is released. Thus there are versions
of libxml2 that are known not to work properly with
XML::LibXML. The Makefile.PL keeps a blacklist of
the incompatible libxml2 versions using Alien::Libxml2.
The blacklist itself is kept inside its "alienfile"
file.
If Makefile.PL detects one of the incompatible versions,
it notifies the user. It may still happen that
XML::LibXML builds and pass its tests with such
a version, but that does not mean everything
is OK. There will be no support at all for blacklisted versions!
As of XML::LibXML 1.61, only versions 2.6.16 and higher are supported.
XML::LibXML will probably not compile with earlier libxml2 versions than
2.5.6. Versions prior to 2.6.8 are known to be broken for various reasons,
versions prior to 2.1.16 exhibit problems with namespaced attributes
and do not therefore pass XML::LibXML regression tests.
It may happen that an unsupported version of libxml2
passes all tests under certain conditions. This is no
reason to assume that it shall work without problems.
If Makefile.PL marks a version of libxml2 as incompatible or broken
it is done for a good reason.
Full linking information for libxml2 can be obtained
by invoking "xml2-config --libs".
Notes for Microsoft Windows
Thanks to Randy Kobes there is a pre-compiled PPM package available on
http://theoryx5.uwinnipeg.ca/ppmpackages/
Usually it takes a little time to build the package for the latest release.
If you want to build XML::LibXML on Windows from source, you can use
the following instructions contributed by Christopher J. Madsen:
These instructions assume that you already have your system set up to
compile modules that use C components.
First, get the libxml2 binaries from http://xmlsoft.org/sources/win32/
(currently also available at http://www.zlatkovic.com/pub/libxml/).
You need:
iconv-VERSION.win32.zip
libxml2-VERSION.win32.zip
zlib-VERSION.win32.zip
Download the latest version of each. (Each package will probably have
a different version.) When you extract them, you'll get directories
named iconv-VERSION.win32, libxml2-VERSION.win32, and
zlib-VERSION.win32, each containing bin, lib, and include directories.
Combine all the bin, include, and lib directories under c:\Prog\LibXML.
(You can use any directory you prefer; just adjust the instructions
accordingly.)
Get the latest version of XML-LibXML from CPAN.
Extract it.
Issue these commands in the XML-LibXML-VERSION directory:
perl Makefile.PL INC=-Ic:\Prog\LibXML\include LIBS=-Lc:\Prog\LibXML\lib
nmake
copy c:\Prog\LibXML\bin\*.dll blib\arch\auto\XML\LibXML
nmake test
nmake install
(Note: Some systems use dmake instead of nmake.)
By copying the libxml2 DLLs to the arch directory, you help avoid
conflicts with other programs you may have installed that use other
(possibly incompatible) versions of those DLLs.
Notes for Mac OS X
Due to a refactoring of the module, XML::LibXML will
not run with some earlier versions of Mac OS X. It
appears that this is related to special linker options
for that OS prior to version 10.2.2. Since the
developers do not have full access to this OS, help/
patches from OS X gurus are highly appreciated.
It is confirmed that XML::LibXML builds and runs
without problems since Mac OS X 10.2.6.
Notes for HPUX
XML::LibXML requires libxml2 2.6.16 or
later. There may not exist a usable binary
libxml2 package for HPUX and XML::LibXML. If
HPUX cc does not compile libxml2
correctly, you will be forced to recompile perl with
gcc (unless you have already done that).
Additionally I received the following Note from Rozi Kovesdi:
Here is my report if someone else runs into the same problem:
Finally I am done with installing all the libraries and XML Perl
modules
The combination that worked best for me was:
gcc
GNU make
Most importantly - before trying to install Perl modules that depend on
libxml2:
must set SHLIB_PATH to include the path to libxml2 shared library
assuming that you used the default:
export SHLIB=/usr/local/lib
also, make sure that the config files have execute permission:
/usr/local/bin/xml2-config
/usr/local/bin/xslt-config
they did not have +x after they were installed by 'make install'
and it took me a while to realize that this was my problem
or one can use:
perl Makefile.PL LIBS='-L/path/to/lib' INC='-I/path/to/include'
Contact
For bug reports, please use the issue tracker at
https://github.com/shlomif/perl-XML-LibXML/issues .
For suggestions etc. you may contact the maintainer directly at
https://www.shlomifish.org/me/contact-me/
, but in general, it is recommended to use the mailing
list given below.
For suggestions etc., and other issues
related to XML::LibXML you may use the perl XML mailing list
(perl-xml@listserv.ActiveState.com),
where most XML-related Perl modules are discussed.
In case of problems you should check the archives of that
list first. Many problems are already discussed there. You
can find the list's archives and subscription options at
http://aspn.activestate.com/ASPN/Mail/Browse/Threaded/perl-xml
Package History
Version < 0.98 were maintained by Matt Sergeant
0.98 > Version > 1.49 were maintained by Matt Sergeant and Christian Glahn
Versions >= 1.49 are maintained by Christian Glahn
Versions > 1.56 are co-maintained by Petr Pajas
Versions >= 1.59 are provisionally maintained by Petr Pajas
Patches and Developer Version
As XML::LibXML is open source software, help and
patches are appreciated. If you find a bug in the current
release, make sure this bug still exists in the developer
version of XML::LibXML. This version can be cloned
from its Git repository. For more information about that,
see:
https://github.com/shlomif/perl-XML-LibXML
Please consider all regression tests as correct. If
any test fails it is most certainly related to a
bug.
If you find documentation bugs, please fix them in
the libxml.dbk file, stored in the docs directory.
Known Issues
The push-parser implementation causes memory leaks.
License
LICENSE
This is free software, you may use it and distribute it under the same terms as Perl itself.
Copyright 2001-2003 AxKit.com Ltd., 2002-2006 Christian Glahn, 2006-2009 Petr Pajas
Disclaimer
THIS PROGRAM IS DISTRIBUTED IN THE HOPE THAT IT WILL
BE USEFUL, BUT WITHOUT ANY WARRANTY; WITHOUT EVEN THE
IMPLIED WARRANTY OF MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE.
Perl Binding for libxml2
XML::LibXML
Synopsis
use XML::LibXML;
my $dom = XML::LibXML->load_xml(string => <<'EOT');
<some-xml/>
EOT
Description
This module is an interface to libxml2, providing
XML and HTML parsers with DOM, SAX and XMLReader interfaces,
a large subset of DOM Layer 3 interface and
a XML::XPath-like interface to XPath API of libxml2.
The module is split into several packages which are not described in this section;
unless stated otherwise, you only need to use XML::LibXML;
in your programs.
Check out XML::LibXML by Example
for a tutorial.
For further information, please check the following documentation:
Parsing XML files with XML::LibXML
XML::LibXML Document Object Model (DOM) Implementation
XML::LibXML direct SAX parser
Reading XML with a pull-parser
XML::LibXML frontend for DTD validation
XML::LibXML frontend for RelaxNG schema validation
XML::LibXML frontend for W3C Schema schema validation
API for evaluating XPath expressions with enhanced support
for the evaluation context
Implementing custom URI Resolver and input callbacks
Common functions for XML::LibXML related Classes
The nodes in the Document Object Model (DOM) are represented by the following classes
(most of which "inherit" from ):
XML::LibXML class for DOM document nodes
Abstract base class for XML::LibXML DOM nodes
XML::LibXML class for DOM element nodes
XML::LibXML class for DOM text nodes
XML::LibXML class for comment DOM nodes
XML::LibXML class for DOM CDATA sections
XML::LibXML DOM attribute class
XML::LibXML's DOM L2 Document Fragment implementation
XML::LibXML DOM namespace nodes
XML::LibXML DOM processing instruction nodes
Encodings support in XML::LibXML
Recall that since version 5.6.1, Perl distinguishes between
character strings (internally encoded in UTF-8) and so
called binary data and, accordingly, applies either
character or byte semantics to them. A scalar
representing a character string is distinguished from
a byte string by special flag (UTF8). Please refer to perlunicode for details.
XML::LibXML's API is designed to deal with many
encodings of XML documents completely transparently, so
that the application using XML::LibXML can be completely
ignorant about the encoding of the XML documents it works with.
On the other hand, functions like XML::LibXML::Document->setEncoding
give the user control over the document encoding.
To ensure the aforementioned transparency and
uniformity, most functions of XML::LibXML that work with
in-memory trees accept and return data as character
strings (i.e. UTF-8 encoded with the UTF8 flag on)
regardless of the original document encoding; however,
the functions related to I/O operations (i.e. parsing
and saving) operate with binary data (in the original
document encoding) obeying the encoding declaration of
the XML documents.
Below we summarize basic rules and principles
regarding encoding:
Do NOT apply any encoding-related PerlIO layers
(:utf8 or :encoding(...))
to file handles that are an input for the parses
or an output for a serializer of (full) XML documents.
This is because the conversion of the data to/from the internal character representation
is provided by libxml2 itself which must be able to enforce the encoding
specified by the <?xml version="1.0" encoding="..."?>
declaration. Here is an example to follow:
use XML::LibXML;
# load
open my $fh, '<', 'file.xml';
binmode $fh; # drop all PerlIO layers possibly created by a use open pragma
$doc = XML::LibXML->load_xml(IO => $fh);
# save
open my $out, '>', 'out.xml';
binmode $out; # as above
$doc->toFH($out);
# or
print {$out} $doc->toString();
All functions working with DOM accept and return
character strings (UTF-8 encoded with UTF8 flag on). E.g.
new('1.0',$some_encoding);
my $element = $doc->createElement($name);
$element->appendText($text);
$xml_fragment = $element->toString(); # returns a character string
$xml_document = $doc->toString(); # returns a byte string
]]>
where
$some_encoding is the document encoding
that will be used when saving the document,
and $name and $text
contain character strings (UTF-8 encoded with UTF8 flag on).
Note that the method toString
returns XML as a character string if applied to
other node than the Document node and
a byte string containing the appropriate
<?xml version="1.0" encoding="..."?>
declaration if applied to a .
DOM methods also accept binary strings in the original encoding of the
document to which the node belongs (UTF-8 is assumed if the node is not
attached to any document). Exploiting this feature is NOT RECOMMENDED
since it is considered bad practice.
new('1.0','iso-8859-2');
my $text = $doc->createTextNode($some_latin2_encoded_byte_string);
# WORKS, BUT NOT RECOMMENDED!
]]>
NOTE: libxml2 support for many
encodings is based on the iconv library. The actual list
of supported encodings may vary from platform to
platform. To test if your platform works correctly with
your language encoding, build a simple document in the
particular encoding and try to parse it with XML::LibXML
to see if the parser produces any errors. Occasional
crashes were reported on rare platforms that ship with a broken
version of iconv.
Thread Support
XML::LibXML since 1.67 partially supports Perl threads
in Perl >= 5.8.8. XML::LibXML can be used with threads
in two ways:
By default, all
XML::LibXML classes use CLONE_SKIP class method
to prevent Perl from copying XML::LibXML::* objects
when a new thread is spawn.
In this mode, all XML::LibXML::* objects are thread specific.
This is the safest way
to work with XML::LibXML in threads.
Alternatively, one may use
use threads;
use XML::LibXML qw(:threads_shared);
to indicate, that
all XML::LibXML node and parser objects
should be shared between the main thread
and any thread spawn from there.
For example, in
my $doc = XML::LibXML->load_xml(location => $filename);
my $thr = threads->new(sub{
# code working with $doc
1;
});
$thr->join;
the variable $doc
refers to the exact same XML::LibXML::Document
in the spawned thread as in the main thread.
Without using mutex locks,
parallel threads may read the same document
(i.e. any node that belongs to the document),
parse files, and modify different documents.
However, if there is a chance that
some of the threads will attempt to modify a document
(or even create
new nodes based on that document,
e.g. with $doc->createElement)
that other threads may be reading at the same time,
the user is responsible for creating a mutex lock
and using it in both
in the thread that modifies and
the thread that reads:
my $doc = XML::LibXML->load_xml(location => $filename);
my $mutex : shared;
my $thr = threads->new(sub{
lock $mutex;
my $el = $doc->createElement('foo');
# ...
1;
});
{
lock $mutex;
my $root = $doc->documentElement;
say $root->name;
}
$thr->join;
Note that libxml2 uses dictionaries to store short strings and
these dictionaries are kept on a document node. Without mutex locks, it
could happen in the previous example that the thread modifies the
dictionary while other threads attempt to read from it, which could
easily lead to a crash.
Version Information
Sometimes it is useful to figure out, for which
version XML::LibXML was compiled for. In most cases this
is for debugging or to check if a given installation meets
all functionality for the package. The functions
XML::LibXML::LIBXML_DOTTED_VERSION and
XML::LibXML::LIBXML_VERSION provide this version
information. Both functions simply pass through the values
of the similar named macros of libxml2.
Similarly, XML::LibXML::LIBXML_RUNTIME_VERSION returns
the version of the (usually dynamically) linked libxml2.
XML::LibXML::LIBXML_DOTTED_VERSION
$Version_String = XML::LibXML::LIBXML_DOTTED_VERSION;
Returns the version string of the
libxml2 version XML::LibXML was compiled
for. This will be "2.6.2" for "libxml2
2.6.2".
XML::LibXML::LIBXML_VERSION
$Version_ID = XML::LibXML::LIBXML_VERSION;
Returns the version id of the libxml2
version XML::LibXML was compiled for. This
will be "20602" for "libxml2 2.6.2". Don't mix
this version id with
$XML::LibXML::VERSION. The latter contains the
version of XML::LibXML itself while the first
contains the version of libxml2 XML::LibXML
was compiled for.
XML::LibXML::LIBXML_RUNTIME_VERSION
$DLL_Version = XML::LibXML::LIBXML_RUNTIME_VERSION;
Returns a version string of the libxml2
which is (usually dynamically) linked by
XML::LibXML. This will be "20602" for libxml2
released as "2.6.2" and something like
"20602-CVS2032" for a CVS build of
libxml2.
XML::LibXML issues a warning if the version
of libxml2 dynamically linked to it is less than the version of libxml2
which it was compiled against.
EXPORTS
By default the module exports all constants and functions
listed in the :all tag, described below.
EXPORT TAGS
:all
Includes the tags :libxml, :encoding, and
:ns described below.
:libxml
Exports integer constants for DOM node types.
XML_ELEMENT_NODE => 1
XML_ATTRIBUTE_NODE => 2
XML_TEXT_NODE => 3
XML_CDATA_SECTION_NODE => 4
XML_ENTITY_REF_NODE => 5
XML_ENTITY_NODE => 6
XML_PI_NODE => 7
XML_COMMENT_NODE => 8
XML_DOCUMENT_NODE => 9
XML_DOCUMENT_TYPE_NODE => 10
XML_DOCUMENT_FRAG_NODE => 11
XML_NOTATION_NODE => 12
XML_HTML_DOCUMENT_NODE => 13
XML_DTD_NODE => 14
XML_ELEMENT_DECL => 15
XML_ATTRIBUTE_DECL => 16
XML_ENTITY_DECL => 17
XML_NAMESPACE_DECL => 18
XML_XINCLUDE_START => 19
XML_XINCLUDE_END => 20
:encoding
Exports two encoding conversion functions from XML::LibXML::Common.
encodeToUTF8()
decodeFromUTF8()
:ns
Exports two convenience constants: the implicit namespace of the
reserved xml: prefix,
and the implicit namespace for the reserved xmlns: prefix.
XML_XML_NS => 'http://www.w3.org/XML/1998/namespace'
XML_XMLNS_NS => 'http://www.w3.org/2000/xmlns/'
Related Modules
The modules described in this section are not part of the XML::LibXML package itself. As they support some additional features, they are
mentioned here.
XML::LibXSLT
XSLT 1.0 Processor using libxslt and XML::LibXML
XML::LibXML::Iterator
XML::LibXML Implementation of the DOM Traversal Specification
XML::CompactTree::XS
Uses XML::LibXML::Reader to very efficiently to parse XML document
or element into native Perl data structures, which are less flexible but
significantly faster to process then DOM.
XML::LibXML and XML::GDOME
Note: THE FUNCTIONS DESCRIBED HERE ARE STILL EXPERIMENTAL
Although both modules make use of libxml2's XML capabilities, the DOM implementation of both modules are not compatible. But still it is
possible to exchange nodes from one DOM to the other. The concept of this exchange is pretty similar to the function cloneNode(): The particular
node is copied on the low-level to the opposite DOM implementation.
Since the DOM implementations cannot coexist within one document, one is forced to copy each node that should be used. Because you are always
keeping two nodes this may cause quite an impact on a machines memory usage.
XML::LibXML provides two functions to export or import GDOME nodes: import_GDOME() and export_GDOME(). Both function have two parameters: the
node and a flag for recursive import. The flag works as in cloneNode().
The two functions allow one to export and import XML::GDOME nodes explicitly, however, XML::LibXML also allows the transparent import of
XML::GDOME nodes in functions such as appendChild(), insertAfter() and so on. While native nodes are automatically adopted in most functions
XML::GDOME nodes are always cloned in advance. Thus if the original node is modified after the operation, the node in the XML::LibXML document will
not have this information.
import_GDOME
$libxmlnode = XML::LibXML->import_GDOME( $node, $deep );
This clones an XML::GDOME node to an XML::LibXML node explicitly.
export_GDOME
$gdomenode = XML::LibXML->export_GDOME( $node, $deep );
Allows one to clone an XML::LibXML node into an
XML::GDOME node.
CONTACTS
For bug reports, please use the CPAN request tracker on http://rt.cpan.org/NoAuth/Bugs.html?Dist=XML-LibXML
For suggestions etc., and other issues
related to XML::LibXML you may use the perl XML mailing list
(perl-xml@listserv.ActiveState.com),
where most XML-related Perl modules are discussed.
In case of problems you should check the archives of that
list first. Many problems are already discussed there. You
can find the list's archives and subscription options at
http://aspn.activestate.com/ASPN/Mail/Browse/Threaded/perl-xml.
Parsing XML Data with XML::LibXML
XML::LibXML::Parser
Synopsis
use XML::LibXML '1.70';
Parsing
An XML document is read into a data structure such as a DOM tree by a piece of software, called a parser. XML::LibXML currently provides four
different parser interfaces:
A DOM Pull-Parser
A DOM Push-Parser
A SAX Parser
A DOM based SAX Parser.
Creating a Parser Instance
XML::LibXML provides an OO interface to the libxml2 parser functions. Thus you have to create a parser instance before you can parse any
XML data.
new
# Parser constructor
$parser = XML::LibXML->new();
$parser = XML::LibXML->new(option=>value, ...);
$parser = XML::LibXML->new({option=>value, ...});
Create a new XML and HTML parser instance.
Each parser instance holds default
values for various parser options.
Optionally,
one can pass a hash reference or
a list of option => value pairs to
set a different default set of options.
Unless specified otherwise, the options
load_ext_dtd, and
expand_entities are set to 1.
See for a list of libxml2 parser's options.
DOM Parser
One of the common parser interfaces of XML::LibXML is the DOM parser. This parser reads XML data into a DOM like data structure, so each
tag can get accessed and transformed.
XML::LibXML's DOM parser is not only capable to parse XML data, but also (strict) HTML files. There are three ways to parse
documents - as a string, as a Perl filehandle, or as a filename/URL. The return value from each is a object, which is a DOM
object.
All of the functions listed below will throw an exception if the document is invalid. To prevent this causing your program exiting, wrap
the call in an eval{} block
load_xml
# Parsing XML
$dom = XML::LibXML->load_xml(
location => $file_or_url
# parser options ...
);
$dom = XML::LibXML->load_xml(
string => $xml_string
# parser options ...
);
$dom = XML::LibXML->load_xml(
string => (\$xml_string)
# parser options ...
);
$dom = XML::LibXML->load_xml({
IO => $perl_file_handle
# parser options ...
);
$dom = $parser->load_xml(...);
This function is available since XML::LibXML 1.70. It provides easy to use interface to the XML parser that parses
given file (or non-HTTPS URL), string, or input stream
to a DOM tree. The arguments
can be passed in a HASH reference
or as name => value pairs.
The function can be called
as a class method or an object method.
In both cases it internally creates a new
parser instance passing
the specified parser options;
if called as an object method,
it clones the original parser (preserving
its settings) and additionally applies
the specified options to the new parser.
See the constructor new
and
for more information.
Note that, due to a limitation in the underlying libxml2
library, this call does not recognize HTTPS-based URLs. (It
will treat an HTTPS URL as a filename, likely throwing a "No such
file or directory" exception.)
load_html
# Parsing HTML
$dom = XML::LibXML->load_html(...);
$dom = $parser->load_html(...);
This function is available since XML::LibXML 1.70. It has the same usage as load_xml,
providing interface to the HTML parser.
See load_xml for more information.
Parsing HTML may cause problems, especially if
the ampersand ('&') is used. This is a common
problem if HTML code is parsed that contains links to
CGI-scripts. Such links cause the parser to throw
errors. In such cases libxml2 still parses the entire
document as there was no error, but the error causes
XML::LibXML to stop the parsing process. However, the
document is not lost. Such HTML documents should be
parsed using the recover flag. By
default recovering is deactivated.
The functions described above are implemented to
parse well formed documents. In some cases a program
gets well balanced XML instead of well formed
documents (e.g. an XML fragment from a database). With
XML::LibXML it is not required to wrap such fragments
in the code, because XML::LibXML is capable even to
parse well balanced XML fragments.
parse_balanced_chunk
# Parsing well-balanced XML chunks
$fragment = $parser->parse_balanced_chunk( $wbxmlstring, $encoding );
This function parses a well balanced XML string into a . The first arguments contains the input string, the optional second argument can be used to specify character encoding of the input (UTF-8 is assumed by default).
parse_xml_chunk
This is the old name of parse_balanced_chunk(). Because it may causes confusion with the push parser interface, this function
should not be used anymore.
By default XML::LibXML does not process XInclude tags
within an XML Document (see options section below).
XML::LibXML allows one to post-process a document to expand
XInclude tags.
process_xincludes
# Processing XInclude
$parser->process_xincludes( $doc );
After a document is parsed into a DOM structure, you may want to expand the documents XInclude tags. This function processes
the given document structure and expands all XInclude tags (or throws an error) by using the flags and callbacks of the given parser
instance.
Note that the resulting Tree contains some extra nodes (of type XML_XINCLUDE_START and XML_XINCLUDE_END) after successfully
processing the document. These nodes indicate where data was included into the original tree. if the document is serialized, these
extra nodes will not show up.
Remember: A Document with processed XIncludes differs from the original document after serialization, because the original
XInclude tags will not get restored!
If the parser flag "expand_xincludes" is set to 1, you need not to post process the parsed document.
processXIncludes
$parser->processXIncludes( $doc );
This is an alias to process_xincludes, but through a JAVA like function name.
parse_file
# Old-style parser interfaces
$doc = $parser->parse_file( $xmlfilename );
This function parses an XML document from a file or network;
$xmlfilename can be either a filename or a (non-HTTPS) URL.
Note that for parsing files, this function is the fastest choice,
about 6-8 times faster then parse_fh().
parse_fh
$doc = $parser->parse_fh( $io_fh );
parse_fh() parses a IOREF or a subclass of IO::Handle.
Because the data comes from an open handle, libxml2's parser does not know about the base URI of the document. To set the
base URI one should use parse_fh() as follows:
my $doc = $parser->parse_fh( $io_fh, $baseuri );
parse_string
$doc = $parser->parse_string( $xmlstring);
This function is similar to parse_fh(), but it parses an XML document that is available as a single string in memory, or alternatively as a reference to a scalar containing a string. Again,
you can pass an optional base URI to the function.
my $doc = $parser->parse_string( $xmlstring, $baseuri );
my $doc = $parser->parse_string(\$xmlstring, $baseuri);
parse_html_file
$doc = $parser->parse_html_file( $htmlfile, \%opts );
Similar to parse_file() but parses HTML (strict) documents;
$htmlfile can be filename or (non-HTTPS) URL.
An optional second argument can be
used to pass some options to the HTML
parser as a HASH reference.
See options labeled with HTML in .
parse_html_fh
$doc = $parser->parse_html_fh( $io_fh, \%opts );
Similar to parse_fh() but parses HTML (strict) streams.
An optional second argument can be used
to pass some options to the HTML parser
as a HASH reference.
See options labeled with HTML in .
Note: encoding option may
not work correctly with this function
in libxml2 < 2.6.27 if the HTML file
declares charset using a META tag.
parse_html_string
$doc = $parser->parse_html_string( $htmlstring, \%opts );
Similar to parse_string() but parses HTML (strict) strings.
An optional second argument can be used to pass some options to the
HTML parser as a HASH reference.
See options labeled with HTML in .
Push Parser
XML::LibXML provides a push parser interface. Rather than pulling the data from a given source the push parser waits for the data to be
pushed into it.
This allows one to parse large documents without waiting for the parser to finish. The interface is especially useful if a program needs
to pre-process the incoming pieces of XML (e.g. to detect document boundaries).
While XML::LibXML parse_*() functions force the data to be a well-formed XML, the push parser will take any arbitrary string that contains
some XML data. The only requirement is that all the pushed strings are together a well formed document. With the push parser interface a
program can interrupt the parsing process as required, where the parse_*() functions give not enough flexibility.
Different to the pull parser implemented in parse_fh() or parse_file(), the push parser is not able to find out about the documents end
itself. Thus the calling program needs to indicate explicitly when the parsing is done.
In XML::LibXML this is done by a single function:
parse_chunk
# Push parser
$parser->parse_chunk($string, $terminate);
parse_chunk() tries to parse a given chunk of data, which isn't necessarily well balanced data. The function takes two
parameters: The chunk of data as a string and optional a termination flag. If the termination flag is set to a true value (e.g. 1),
the parsing will be stopped and the resulting document will be returned as the following example describes:
my $parser = XML::LibXML->new;
for my $string ( "<", "foo", ' bar="hello world"', "/>") {
$parser->parse_chunk( $string );
}
my $doc = $parser->parse_chunk("", 1); # terminate the parsing
Internally XML::LibXML provides three functions that control the push parser process:
init_push
$parser->init_push();
Initializes the push parser.
push
$parser->push(@data);
This function pushes the data stored inside the array to libxml2's parser. Each entry in @data must be a normal scalar! This method can be called repeatedly.
finish_push
$doc = $parser->finish_push( $recover );
This function returns the result of the parsing process. If this function is called without a parameter it will complain about
non well-formed documents. If $restore is 1, the push parser can be used to restore broken or non well formed (XML) documents as the
following example shows:
eval {
$parser->push( "<foo>", "bar" );
$doc = $parser->finish_push(); # will report broken XML
};
if ( $@ ) {
# ...
}
This can be annoying if the closing tag is missed by accident. The following code will restore the document:
eval {
$parser->push( "<foo>", "bar" );
$doc = $parser->finish_push(1); # will return the data parsed
# unless an error happened
};
print $doc->toString(); # returns "<foo>bar</foo>"
Of course finish_push() will return nothing if there was no data pushed to the parser before.
Pull Parser (Reader)
XML::LibXML also provides a pull-parser interface similar to the XmlReader interface in .NET.
This interface is almost streaming, and is usually faster and simpler to use than SAX.
See .
Direct SAX Parser
XML::LibXML provides a direct SAX parser in the module.
DOM based SAX Parser
XML::LibXML also provides a DOM based SAX parser. The SAX parser is defined in
the module XML::LibXML::SAX::Parser. As it is not a stream based parser, it
parses documents into a DOM and traverses the DOM tree instead.
The API of this parser is exactly the same as any other Perl SAX2 parser. See XML::SAX::Intro for details.
Aside from the regular parsing methods, you can access the DOM tree traverser directly, using the generate() method:
my $doc = build_yourself_a_document();
my $saxparser = $XML::LibXML::SAX::Parser->new( ... );
$parser->generate( $doc );
This is useful for serializing DOM trees, for example that you might have done prior processing on, or that you have as a result of XSLT
processing.
WARNING
This is NOT a streaming SAX parser. As I said above, this parser reads the entire document into a DOM and serialises it. Some people
couldn't read that in the paragraph above so I've added this warning. If you want a streaming SAX parser look at the man page
Serialization
XML::LibXML provides some functions to serialize nodes and documents. The serialization functions are described on the
manpage or the manpage. XML::LibXML checks three global flags that alter the serialization process:
skipXMLDeclaration
skipDTD
setTagCompression
of that three functions only setTagCompression is available for all serialization functions.
Because XML::LibXML does these flags not itself, one has to define them locally as the following example shows:
local $XML::LibXML::skipXMLDeclaration = 1;
local $XML::LibXML::skipDTD = 1;
local $XML::LibXML::setTagCompression = 1;
If skipXMLDeclaration is defined and not '0', the XML declaration is omitted during serialization.
If skipDTD is defined and not '0', an existing DTD would not be serialized with the document.
If setTagCompression is defined and not '0' empty tags are displayed as open and closing tags rather than the shortcut. For example
the empty tag foo will be rendered as <foo></foo> rather than <foo/>.
Parser Options
Handling of libxml2 parser options has been unified and improved in XML::LibXML 1.70.
You can now set default options for a particular parser instance by
passing them to the constructor as XML::LibXML->new({name=>value, ...})
or XML::LibXML->new(name=>value,...).
The options can be queried and changed using the following methods (pre-1.70 interfaces such as $parser->load_ext_dtd(0) also exist, see below):
option_exists
# Set/query parser options
$parser->option_exists($name);
Returns 1 if the current XML::LibXML version supports
the option $name, otherwise returns 0 (note that this does not necessarily mean that the option is supported
by the underlying libxml2 library).
get_option
$parser->get_option($name);
Returns the current value of the parser option $name.
set_option
$parser->set_option($name,$value);
Sets option $name to value $value.
set_options
$parser->set_options({$name=>$value,...});
Sets multiple parsing options at once.
IMPORTANT NOTE: This documentation reflects the parser flags available in libxml2 2.7.3.
Some options have no effect if an older version of libxml2 is used.
Each of the flags listed below is labeled
/parser/
if it can be used with a XML::LibXML
parser object (i.e. passed to XML::LibXML->new, XML::LibXML->set_option, etc.)
/html/
if it can be used passed to the parse_html_* methods
/reader/
if it can be used with the XML::LibXML::Reader.
Unless specified otherwise, the default for boolean valued options is 0 (false).
The available options are:
URI
/parser, html, reader/
In case of parsing strings or file handles, XML::LibXML doesn't know about the base uri of the document. To make relative
references such as XIncludes work, one has to set a base URI, that is then used for the parsed document.
line_numbers
/parser, html, reader/
If this option is activated, libxml2 will store the line number of each element node in the parsed document.
The line number can be obtained using the line_number() method
of the XML::LibXML::Node class (for non-element nodes
this may report the line number of the containing element).
The line numbers are also used for reporting positions of validation errors.
IMPORTANT:
Due to limitations in the libxml2 library line numbers greater than
65535 will be returned as 65535. Unfortunately, this is a long and sad story, please see
http://bugzilla.gnome.org/show_bug.cgi?id=325533 for more details.
encoding
/html/
character encoding of the input
recover
/parser, html, reader/
recover from errors; possible values are 0, 1, and 2
A true value turns on recovery mode which
allows one to parse broken XML or HTML data.
The recovery mode allows the parser to return
the successfully parsed portion of the input document.
This is useful for almost well-formed documents, where for example
a closing tag is missing somewhere. Still,
XML::LibXML will only parse until the first fatal (non-recoverable) error occurs,
reporting recoverable parsing errors as warnings. To suppress
even these warnings, use recover=>2.
Note that validation is switched off automatically in recovery mode.
expand_entities
/parser, reader/
substitute entities; possible values are 0 and 1; default is 1
Note that although this flag disables entity substitution, it
does not prevent the parser from loading external entities;
when substitution of an external entity is disabled, the
entity will be represented in the document tree by an XML_ENTITY_REF_NODE node
whose subtree will be the content obtained by parsing the external resource;
Although this nesting is visible from the DOM
it is transparent to XPath data model, so it is possible to
match nodes in an unexpanded entity by the same XPath expression
as if the entity were expanded. See also ext_ent_handler.
ext_ent_handler
/parser/
Provide a custom external entity handler
to be used when expand_entities is set to 1.
Possible value is a subroutine reference.
This feature does not work properly in libxml2 < 2.6.27!
The subroutine provided is called whenever
the parser needs to retrieve the content of an external entity.
It is called with two arguments: the system ID (URI) and the public ID.
The value returned by the subroutine is parsed as the content of the entity.
This method can be used to completely disable entity loading,
e.g. to prevent exploits of the type described at
,
where a service is tricked to expose its private data
by letting it parse a remote file (RSS feed) that contains an entity reference to a local
file (e.g. /etc/fstab).
A more granular solution to this problem, however, is
provided by custom URL resolvers, as in
my $c = XML::LibXML::InputCallback->new();
sub match { # accept file:/ URIs except for XML catalogs in /etc/xml/
my ($uri) = @_;
return ($uri=~m{^file:/}
and $uri !~ m{^file:///etc/xml/})
? 1 : 0;
}
$c->register_callbacks([ \&match, sub{}, sub{}, sub{} ]);
$parser->input_callbacks($c);
load_ext_dtd
/parser, reader/
load the external DTD subset while parsing; possible values are 0 and 1. Unless specified,
XML::LibXML sets this option to 1.
This flag is also required for DTD Validation, to provide complete attribute,
and to expand entities, regardless if the document has an internal subset.
Thus switching off external DTD loading, will disable entity expansion,
validation, and complete attributes on internal subsets as well.
complete_attributes
/parser, reader/
create default DTD attributes; possible values are 0 and 1
validation
/parser, reader/
validate with the DTD; possible values are 0 and 1
suppress_errors
/parser, html, reader/
suppress error reports; possible values are 0 and 1
suppress_warnings
/parser, html, reader/
suppress warning reports; possible values are 0 and 1
pedantic_parser
/parser, html, reader/
pedantic error reporting; possible values are 0 and 1
no_blanks
/parser, html, reader/
remove blank nodes; possible values are 0 and 1
no_defdtd
/html/
do not add a default DOCTYPE; possible values are 0 and 1
the default is (0) to add a DTD when the input html lacks one
expand_xinclude or xinclude
/parser, reader/
Implement XInclude substitution; possible values are 0 and 1
Expands XInclude tags immediately while parsing the document.
Note that the parser will use the URI resolvers installed
via XML::LibXML::InputCallback to parse the included document (if any).
no_xinclude_nodes
/parser, reader/
do not generate XINCLUDE START/END nodes; possible values are 0 and 1
no_network
/parser, html, reader/
Forbid network access; possible values are 0 and 1
If set to true, all
attempts to fetch non-local resources (such as
DTD or external entities) will fail (unless
custom callbacks are defined).
It may be
necessary to use the flag recover for
processing documents requiring such resources
while networking is off.
clean_namespaces
/parser, reader/
remove redundant namespaces declarations during parsing; possible values are 0 and 1.
no_cdata
/parser, html, reader/
merge CDATA as text nodes; possible values are 0 and 1
no_basefix
/parser, reader/
not fixup XINCLUDE xml#base URIS; possible values are 0 and 1
huge
/parser, html, reader/
relax any hardcoded limit from the parser; possible values are 0 and 1. Unless specified,
XML::LibXML sets this option to 0.
Note: the default value for this option was changed to protect against denial
of service through entity expansion attacks. Before enabling the option ensure
you have taken alternative measures to protect your application against this type
of attack.
gdome
/parser/
THIS OPTION IS EXPERIMENTAL!
Although quite powerful, XML::LibXML's DOM implementation is incomplete with respect to
the DOM level 2 or level 3 specifications.
XML::GDOME is based on libxml2 as well, and provides a rather complete DOM implementation by wrapping libgdome.
This flag allows you to make
use of XML::LibXML's full parser options and XML::GDOME's DOM implementation at the same time.
To make use of this function, one has to install libgdome and configure XML::LibXML to use this library.
For this you need to rebuild XML::LibXML!
Note: this feature was not seriously tested in recent XML::LibXML releases.
For compatibility with XML::LibXML versions prior to 1.70,
the following methods are also supported for querying and setting the corresponding parser options
(if called without arguments, the methods return
the current value of the corresponding parser options; with an argument sets the option to a given value):
$parser->validation();
$parser->recover();
$parser->pedantic_parser();
$parser->line_numbers();
$parser->load_ext_dtd();
$parser->complete_attributes();
$parser->expand_xinclude();
$parser->gdome_dom();
$parser->clean_namespaces();
$parser->no_network();
The following obsolete methods trigger parser options in some
special way:
recover_silently
$parser->recover_silently(1);
If called without an argument,
returns true if the current value of the recover parser
option is 2 and returns false otherwise.
With a true argument sets the recover parser option to 2;
with a false argument sets the recover parser option to 0.
expand_entities
$parser->expand_entities(0);
Get/set the expand_entities option.
If called with a true argument, also turns
the load_ext_dtd option to 1.
keep_blanks
$parser->keep_blanks(0);
This is actually the opposite of the no_blanks parser option.
If used without an argument retrieves negated value of no_blanks.
If used with an argument sets no_blanks to the opposite value.
base_uri
$parser->base_uri( $your_base_uri );
Get/set the URI option.
XML Catalogs
libxml2 supports XML catalogs.
Catalogs are used to
map remote resources to their local copies.
Using catalogs can speed up parsing processes if
many external resources from remote addresses
are loaded into the parsed documents (such as DTDs or XIncludes).
Note that libxml2 has a global pool of loaded catalogs,
so if you apply the method load_catalog
to one parser instance, all parser instances will start using the catalog
(in addition to other previously loaded catalogs).
Note also that catalogs are not used
when a custom external entity handler is specified. At the
current state it is not possible to make use of both
types of resolving systems at the same time.
load_catalog
# XML catalogs
$parser->load_catalog( $catalog_file );
Loads the XML catalog file $catalog_file.
# Global external entity loader (similar to ext_ent_handler option
# but this works really globally, also in XML::LibXSLT include etc..)
XML::LibXML::externalEntityLoader(\&my_loader);
Error Reporting
XML::LibXML throws exceptions during parsing, validation or XPath processing (and some other occasions). These errors can be caught by using
eval blocks. The error is stored in $@.
There are two implementations: the old one throws $@ which is just a message string,
in the new one $@ is an object from the class XML::LibXML::Error;
this class overrides the operator "" so that when printed,
the object flattens to the usual error message.
XML::LibXML throws errors as they occur. This is a very common misunderstanding in the use of XML::LibXML. If the eval is omitted, XML::LibXML will always halt your script by
"croaking" (see Carp man page for details).
Also note that an increasing number of functions throw errors if bad data is passed as arguments. If you cannot assure valid data passed to XML::LibXML you should eval
these functions.
Note: since version 1.59, get_last_error() is no longer available in XML::LibXML for thread-safety reasons.
XML::LibXML direct SAX parser
XML::LibXML::SAX
Description
XML::LibXML provides an interface to libxml2 direct SAX interface. Through this interface it is possible to generate SAX events directly while
parsing a document. While using the SAX parser XML::LibXML will not create a DOM Document tree.
Such an interface is useful if very large XML documents have to be processed and no DOM functions are required. By using this interface it is
possible to read data stored within an XML document directly into the application data structures without loading the document into memory.
The SAX interface of XML::LibXML is based on the famous XML::SAX interface. It uses the generic interface as provided by XML::SAX::Base.
Additionally to the generic functions, which are only able to process entire documents, XML::LibXML::SAX provides parse_chunk().
This method generates SAX events from well balanced data such as is often provided by databases.
Features
NOTE: This feature is experimental.
You can enable character data joining which may yield a
significant speed boost in your XML processing in lower markup
ratio situations by enabling the
http://xmlns.perl.org/sax/join-character-data feature of this
parser. This is done via the set_feature method like
this:
$p->set_feature('http://xmlns.perl.org/sax/join-character-data', 1);
You can also specify a 0 to disable. The default is to have
this feature disabled.
Building DOM trees from SAX events.
XML::LibXML::SAX::Builder
Synopsis
use XML::LibXML::SAX::Builder;
my $builder = XML::LibXML::SAX::Builder->new();
my $gen = XML::Generator::DBI->new(Handler => $builder, dbh => $dbh);
$gen->execute("SELECT * FROM Users");
my $doc = $builder->result();
Description
This is a SAX handler that generates a DOM tree from SAX events. Usage is as above. Input is accepted from any SAX1 or SAX2 event generator.
Building DOM trees from SAX events is quite easy with XML::LibXML::SAX::Builder. The class is designed as a SAX2 final handler not as a
filter!
Since SAX is strictly stream oriented, you should not expect anything to return from a generator. Instead you have to ask the builder instance
directly to get the document built. XML::LibXML::SAX::Builder's result() function holds the document generated from the last SAX stream.
XML::LibXML DOM Implementation
XML::LibXML::DOM
Description
XML::LibXML provides a lightweight interface to
modify a node of the document tree
generated by the XML::LibXML parser. This interface
follows as far as possible the DOM Level 3
specification. In addition to the specified functions,
XML::LibXML supports some functions that are more handy to
use in the perl environment.
One also has to remember, that XML::LibXML is an
interface to libxml2 nodes which actually reside on the
C-Level of XML::LibXML. This means each node is a
reference to a structure which is different from a perl hash or
array. The only way to access these structures' values is
through the DOM interface provided by XML::LibXML. This
also means, that one can't simply
inherit an XML::LibXML node and add new member variables as
if they were hash keys.
The DOM interface of XML::LibXML does not intend to
implement a full DOM interface as it is done by XML::GDOME
and used for full featured application. Moreover, it
offers an simple way to build or modify documents that are
created by XML::LibXML's parser.
Another target of the XML::LibXML interface is to
make the interfaces of libxml2 available to the perl
community. This includes also some workarounds to some
features where libxml2 assumes more control over the
C-Level that most perl users don't have.
One of the most important parts of the XML::LibXML
DOM interface is that the interfaces try to follow the
DOM Level 3 specification rather strictly. This means the
interface functions are named as the DOM specification
says and not what widespread Java interfaces claim to be
the standard. Although there are several functions that have
only a singular interface that conforms to the DOM spec
XML::LibXML provides an additional Java style alias
interface.
Moreover, there are some function interfaces left over
from early stages of XML::LibXML for compatibility
reasons. These interfaces are for compatibility reasons
only. They might disappear in one of
the future versions of XML::LibXML, so a user is requested
to switch over to the official functions.
Encodings and XML::LibXML's DOM implementation
See the section on Encodings in the XML::LibXML manual page.
Namespaces and XML::LibXML's DOM implementation
XML::LibXML's DOM implementation is
limited by the DOM implementation of libxml2
which treats namespaces slightly differently than
required by the DOM Level 2 specification.
According to the DOM Level 2 specification,
namespaces of elements and attributes should be
persistent, and nodes should be permanently bound to
namespace URIs as they get created; it should be
possible to manipulate the special attributes used for
declaring XML namespaces just as other attributes
without affecting the namespaces of other nodes.
In DOM Level 2, the application is responsible
for creating the special attributes consistently and/or for correct
serialization of the document.
This is both inconvenient, causes problems in serialization
of DOM to XML, and most importantly, seems almost impossible
to implement over libxml2.
In libxml2, namespace URI and prefix of a node is
provided by a pointer to a namespace declaration
(appearing as a special xmlns attribute in the XML
document). If the prefix or namespace URI of the
declaration changes, the prefix and namespace URI of all
nodes that point to it changes as well. Moreover, in
contrast to DOM, a node (element or attribute) can only
be bound to a namespace URI if there is some namespace
declaration in the document to point to.
Therefore current DOM implementation in XML::LibXML tries
to treat namespace declarations in a compromise between
reason, common sense, limitations of libxml2, and the DOM
Level 2 specification.
In XML::LibXML, special attributes declaring XML namespaces
are often created automatically, usually when
a namespaced node is attached to a document
and no existing declaration of the namespace and prefix is in the
scope to be reused.
In this respect,
XML::LibXML DOM implementation differs from the DOM
Level 2 specification according to which special
attributes for declaring the appropriate XML namespaces
should not be added when a node with a namespace prefix
and namespace URI is created.
Namespace declarations are also created when
's
createElementNS() or createAttributeNS() function are used. If the
a namespace is not declared on the documentElement, the
namespace will be locally declared for the newly created
node. In case of Attributes this may look a bit confusing,
since these nodes cannot have namespace declarations
itself. In this case the namespace is internally applied
to the attribute and later declared on the node the
attribute is appended to (if required).
The following example may explain this a bit:
my $doc = XML::LibXML->createDocument;
my $root = $doc->createElementNS( "", "foo" );
$doc->setDocumentElement( $root );
my $attr = $doc->createAttributeNS( "bar", "bar:foo", "test" );
$root->setAttributeNodeNS( $attr );
This piece of code will result in the following document:
<?xml version="1.0"?>
<foo xmlns:bar="bar" bar:foo="test"/>
The namespace is declared on the document element
during the setAttributeNodeNS() call.
Namespaces can be also declared explicitly by the use of XML::LibXML::Element's setNamespace() function.
Since 1.61, they can also be manipulated with functions
setNamespaceDeclPrefix() and setNamespaceDeclURI() (not available in DOM).
Changing an URI or prefix of an existing namespace declaration
affects the namespace URI and prefix of all nodes which point to it
(that is the nodes in its scope).
It is also important to repeat the specification:
While working with namespaces you should use the namespace
aware functions instead of the simplified versions. For
example you should never use
setAttribute() but setAttributeNS().
XML::LibXML DOM Document Class
XML::LibXML::Document
Synopsis
use XML::LibXML;
# Only methods specific to Document nodes are listed here,
# see the XML::LibXML::Node manpage for other methods
Description
The Document Class is in most cases the result of a parsing process. But sometimes it is necessary to create a Document from scratch. The DOM
Document Class provides functions that conform to the DOM Core naming style.
It inherits all functions from as specified in the DOM specification. This enables access to the nodes
besides the root element on document level - a DTD for example. The support for these nodes is limited at the moment.
While generally nodes are bound to a document in the DOM concept it is suggested that one should always create a node not bound to any document.
There is no need of really including the node to the document, but once the node is bound to a document, it is quite safe that all strings have the
correct encoding. If an unbound text node with an ISO encoded string is created (e.g. with $CLASS->new()), the toString function
may not return the expected result.
To prevent such problems, it is recommended to pass all data to XML::LibXML methods
as character strings (i.e. UTF-8 encoded, with the UTF8 flag on).
Methods
Many functions listed here are
extensively documented in the DOM Level 3 specification. Please refer to
the specification for extensive documentation.
new
$dom = XML::LibXML::Document->new( $version, $encoding );
alias for createDocument()
createDocument
$dom = XML::LibXML::Document->createDocument( $version, $encoding );
The constructor for the document class. As Parameter it takes the version string and (optionally) the encoding string. Simply calling
createDocument() will create the document:
<?xml version="your version" encoding="your encoding"?>
Both parameter are optional. The default value for $version is 1.0, of course. If the
$encoding parameter is not set, the encoding will be left unset, which means UTF-8 is implied.
The call of createDocument() without any parameter will result the following code:
<?xml version="1.0"?>
Alternatively one can call this constructor directly from the XML::LibXML class level, to avoid some typing. This will not have any
effect on the class instance, which is always XML::LibXML::Document.
my $document = XML::LibXML->createDocument( "1.0", "UTF-8" );
is therefore a shortcut for
my $document = XML::LibXML::Document->createDocument( "1.0", "UTF-8" );
URI
$strURI = $doc->URI();
Returns the URI (or filename) of the original document.
For documents obtained by parsing a string of a FH
without using the URI parsing argument of the corresponding parse_* function,
the result is a generated string unknown-XYZ where XYZ is some number;
for documents created with the constructor new,
the URI is undefined.
The value can be modified by calling setURI
method on the document node.
setURI
$doc->setURI($strURI);
Sets the URI of the document reported by the method URI
(see also the URI argument to the various parse_* functions).
encoding
$strEncoding = $doc->encoding();
returns the encoding string of the document.
my $doc = XML::LibXML->createDocument( "1.0", "ISO-8859-15" );
print $doc->encoding; # prints ISO-8859-15
actualEncoding
$strEncoding = $doc->actualEncoding();
returns the encoding in which the XML will be returned by $doc->toString().
This is usually the original encoding of the document as declared
in the XML declaration and returned by $doc->encoding.
If the original encoding is not known (e.g. if created in memory or parsed from a
XML without a declared encoding), 'UTF-8' is returned.
my $doc = XML::LibXML->createDocument( "1.0", "ISO-8859-15" );
print $doc->encoding; # prints ISO-8859-15
setEncoding
$doc->setEncoding($new_encoding);
This method allows one to change the declaration of
encoding in the XML declaration of the document.
The value also affects the encoding in which the
document is serialized to XML by $doc->toString().
Use setEncoding() to remove the encoding declaration.
version
$strVersion = $doc->version();
returns the version string of the document
getVersion() is an alternative form of this function.
standalone
$doc->standalone
This function returns the Numerical value of a documents XML declarations standalone attribute. It returns 1 if
standalone="yes" was found, 0 if standalone="no" was found and -1 if standalone
was not specified (default on creation).
setStandalone
$doc->setStandalone($numvalue);
Through this method it is possible to alter the value of a documents standalone attribute. Set it to 1 to set
standalone="yes", to 0 to set standalone="no" or set it to -1 to remove the
standalone attribute from the XML declaration.
compression
my $compression = $doc->compression;
libxml2 allows reading of documents directly from gzipped files. In this case the compression variable is set to the compression level
of that file (0-8). If XML::LibXML parsed a different source or the file wasn't compressed, the returned value will be
-1.
setCompression
$doc->setCompression($ziplevel);
If one intends to write the document directly to a file, it is possible to set the compression level for a given document. This level
can be in the range from 0 to 8. If XML::LibXML should not try to compress use -1 (default).
Note that this feature will only work if libxml2 is compiled with zlib support and toFile() is used for output.
toString
$docstring = $dom->toString($format);
toString is a DOM serializing function,
so the DOM Tree is serialized into an XML string, ready for output.
IMPORTANT: unlike toString for other nodes, on document nodes
this function returns the XML as a byte string in the original encoding of the
document (see the actualEncoding() method)! This means you
can simply do:
open my $out_fh, '>', $file;
print {$out_fh} $doc->toString;
regardless of the actual encoding of the document.
See the section on encodings in for more details.
The optional $format parameter sets the indenting of the output. This parameter is expected to be an
integer value, that specifies that indentation should be used. The format parameter can have three different values if
it is used:
If $format is 0, than the document is dumped as it was originally parsed
If $format is 1, libxml2 will add ignorable white spaces, so the nodes content is easier to read. Existing text nodes will not be
altered
If $format is 2 (or higher), libxml2 will act as $format == 1 but it add a leading and a trailing line break to each text node.
libxml2 uses a hard-coded indentation of 2 space characters per indentation level. This value can not be altered on run-time.
toStringC14N
$c14nstr = $doc->toStringC14N($comment_flag, $xpath [, $xpath_context ]);
See the documentation in .
toStringEC14N
$ec14nstr = $doc->toStringEC14N($comment_flag, $xpath [, $xpath_context ], $inclusive_prefix_list);
See the documentation in .
serialize
$str = $doc->serialize($format);
An alias for toString(). This function was name added to be more consistent
with libxml2.
serialize_c14n
An alias for toStringC14N().
serialize_exc_c14n
An alias for toStringEC14N().
toFile
$state = $doc->toFile($filename, $format);
This function is similar to toString(), but it writes the document directly into a filesystem. This function is very useful, if one
needs to store large documents.
The format parameter has the same behaviour as in toString().
toFH
$state = $doc->toFH($fh, $format);
This function is similar to toString(), but it writes the document directly to a filehandle or a stream. A byte stream in the document encoding is passed to the file handle. Do NOT apply any :encoding(...) or :utf8 PerlIO layer to
the filehandle! See the section on encodings in for more details.
The format parameter has the same behaviour as in toString().
toStringHTML
$str = $document->toStringHTML();
toStringHTML serialize the tree to a byte string in the document encoding as HTML. With this method indenting is automatic and managed by
libxml2 internally.
serialize_html
$str = $document->serialize_html();
An alias for toStringHTML().
is_valid
$bool = $dom->is_valid();
Returns either TRUE or FALSE depending on whether the DOM Tree is a valid Document or not.
You may also pass in a object, to validate against an external DTD:
if (!$dom->is_valid($dtd)) {
warn("document is not valid!");
}
validate
$dom->validate();
This is an exception throwing equivalent of is_valid. If the document is not valid it will throw an exception containing the error.
This allows you much better error reporting than simply is_valid or not.
Again, you may pass in a DTD object
documentElement
$root = $dom->documentElement();
Returns the root element of the Document. A document can have just one root element to contain the documents data.
Optionally one can use getDocumentElement.
setDocumentElement
$dom->setDocumentElement( $root );
This function enables you to set the root element for a document. The function supports the import of a node from a different document
tree, but does not support a document fragment as $root.
createElement
$element = $dom->createElement( $nodename );
This function creates a new Element Node bound to the DOM with the name $nodename.
createElementNS
$element = $dom->createElementNS( $namespaceURI, $nodename );
This function creates a new Element Node bound to the DOM with the name $nodename and placed in the given
namespace.
createTextNode
$text = $dom->createTextNode( $content_text );
As an equivalent of createElement, but it creates a Text Node bound to the DOM.
createComment
$comment = $dom->createComment( $comment_text );
As an equivalent of createElement, but it creates a Comment Node bound to the DOM.
createAttribute
$attrnode = $doc->createAttribute($name [,$value]);
Creates a new Attribute node.
createAttributeNS
$attrnode = $doc->createAttributeNS( namespaceURI, $name [,$value] );
Creates an Attribute bound to a namespace.
createDocumentFragment
$fragment = $doc->createDocumentFragment();
This function creates a DocumentFragment.
createCDATASection
$cdata = $dom->createCDATASection( $cdata_content );
Similar to createTextNode and createComment, this function creates a CDataSection bound to the current DOM.
createProcessingInstruction
my $pi = $doc->createProcessingInstruction( $target, $data );
create a processing instruction node.
Since this method is quite long one may use its short form createPI().
createEntityReference
my $entref = $doc->createEntityReference($refname);
If a document has a DTD specified, one can create entity references by using this function. If one wants to add a entity reference to
the document, this reference has to be created by this function.
An entity reference is unique to a document and cannot be passed to other documents as other nodes can be passed.
NOTE: A text content containing something that looks like an entity reference, will not be expanded to a real
entity reference unless it is a predefined entity
my $string = "&foo;";
$some_element->appendText( $string );
print $some_element->textContent; # prints "&foo;"
createInternalSubset
$dtd = $document->createInternalSubset( $rootnode, $public, $system);
This function creates and adds an internal subset to the given document. Because the function automatically adds the DTD to the document
there is no need to add the created node explicitly to the document.
my $document = XML::LibXML::Document->new();
my $dtd = $document->createInternalSubset( "foo", undef, "foo.dtd" );
will result in the following XML document:
<?xml version="1.0"?>
<!DOCTYPE foo SYSTEM "foo.dtd">
By setting the public parameter it is possible to set PUBLIC DTDs to a given document. So
my $document = XML::LibXML::Document->new();
my $dtd = $document->createInternalSubset( "foo", "-//FOO//DTD FOO 0.1//EN", undef );
will cause the following declaration to be created on the document:
<?xml version="1.0"?>
<!DOCTYPE foo PUBLIC "-//FOO//DTD FOO 0.1//EN">
createExternalSubset
$dtd = $document->createExternalSubset( $rootnode_name, $publicId, $systemId);
This function is similar to createInternalSubset() but this DTD is considered to be external and is therefore not
added to the document itself. Nevertheless it can be used for validation purposes.
importNode
$document->importNode( $node );
If a node is not part of a document, it can be imported to another document. As specified in DOM Level 2 Specification the Node will
not be altered or removed from its original document ($node->cloneNode(1) will get called implicitly).
NOTE: Don't try to use importNode() to import sub-trees that contain an entity reference - even if the entity
reference is the root node of the sub-tree. This will cause serious problems to your program. This is a limitation of libxml2 and not of
XML::LibXML itself.
adoptNode
$document->adoptNode( $node );
If a node is not part of a document, it can be imported to another document. As specified in DOM Level 3 Specification the Node will
not be altered but it will removed from its original document.
After a document adopted a node, the node, its attributes and all its descendants belong to the new document. Because the node does
not belong to the old document, it will be unlinked from its old location first.
NOTE: Don't try to adoptNode() to import sub-trees that contain entity references - even if the entity
reference is the root node of the sub-tree. This will cause serious problems to your program. This is a limitation of libxml2 and not of
XML::LibXML itself.
externalSubset
my $dtd = $doc->externalSubset;
If a document has an external subset defined it will be returned by this function.
NOTE Dtd nodes are no ordinary nodes in libxml2. The support for these nodes in XML::LibXML is still limited. In
particular one may not want use common node function on doctype declaration nodes!
internalSubset
my $dtd = $doc->internalSubset;
If a document has an internal subset defined it will be returned by this function.
NOTE Dtd nodes are no ordinary nodes in libxml2. The support for these nodes in XML::LibXML is still limited. In
particular one may not want use common node function on doctype declaration nodes!
setExternalSubset
$doc->setExternalSubset($dtd);
EXPERIMENTAL!
This method sets a DTD node as an external subset of the given document.
setInternalSubset
$doc->setInternalSubset($dtd);
EXPERIMENTAL!
This method sets a DTD node as an internal subset of the given document.
removeExternalSubset
my $dtd = $doc->removeExternalSubset();
EXPERIMENTAL!
If a document has an external subset defined it can be removed from the document by using this function. The removed dtd node will be
returned.
removeInternalSubset
my $dtd = $doc->removeInternalSubset();
EXPERIMENTAL!
If a document has an internal subset defined it can be removed from the document by using this function. The removed dtd node will be
returned.
getElementsByTagName
my @nodelist = $doc->getElementsByTagName($tagname);
Implements the DOM Level 2 function
In SCALAR context this function returns an XML::LibXML::NodeList object.
getElementsByTagNameNS
my @nodelist = $doc->getElementsByTagNameNS($nsURI,$tagname);
Implements the DOM Level 2 function
In SCALAR context this function returns an XML::LibXML::NodeList object.
getElementsByLocalName
my @nodelist = $doc->getElementsByLocalName($localname);
This allows the fetching of all nodes from a given document with the given Localname.
In SCALAR context this function returns an XML::LibXML::NodeList object.
getElementById
my $node = $doc->getElementById($id);
Returns the element that has an ID attribute
with the given value. If no such element exists,
this returns undef.
Note: the ID of an element
may change while manipulating the document.
For documents with a DTD, the information about ID attributes
is only available if DTD loading/validation has been requested.
For HTML documents parsed with the HTML
parser ID detection is done
automatically. In XML documents, all "xml:id"
attributes are considered to be of type ID.
You can test ID-ness of an attribute node
with $attr->isId().
In versions 1.59 and earlier this method was
called getElementsById() (plural) by
mistake. Starting from 1.60 this name is
maintained as an alias only for backward compatibility.
indexElements
$dom->indexElements();
This function causes libxml2 to stamp all elements in a document with their document position index which considerably speeds up XPath
queries for large documents. It should only be used with static documents that won't be further changed by any DOM methods, because once
a document is indexed, XPath will always prefer the index to other methods of determining the document order of nodes. XPath could therefore
return improperly ordered node-lists when applied on a document that has been changed after being indexed. It is of course possible to use
this method to re-index a modified document before using it with XPath again. This function is not a part of the DOM specification.
This function returns number of elements indexed, -1 if error occurred, or -2 if this feature is not available in the running libxml2.
Abstract Base Class of XML::LibXML Nodes
XML::LibXML::Node
Synopsis
use XML::LibXML;
Description
XML::LibXML::Node defines functions that are common to
all Node Types. An XML::LibXML::Node should never be created
standalone, but as an instance of a high level class such as
XML::LibXML::Element or XML::LibXML::Text. The class itself should
provide only common functionality. In XML::LibXML each node is
part either of a document or a document-fragment. Because of
this there is no node without a parent. This may causes
confusion with "unbound" nodes.
Methods
Many functions listed here are
extensively documented in the DOM Level 3 specification. Please refer to
the specification for extensive documentation.
nodeName
$name = $node->nodeName;
Returns the node's name. This function is
aware of namespaces and returns the full name of
the current node (prefix:localname).
Since 1.62 this function also returns the correct
DOM names for node types with constant names, namely:
#text, #cdata-section, #comment, #document,
#document-fragment.
setNodeName
$node->setNodeName( $newName );
In very limited situations, it is useful to change a nodes name. In the DOM specification this should throw an error. This Function is
aware of namespaces.
isSameNode
$bool = $node->isSameNode( $other_node );
returns TRUE (1) if the given nodes refer to
the same node structure, otherwise FALSE (0) is
returned.
isEqual
$bool = $node->isEqual( $other_node );
deprecated version of isSameNode().
NOTE isEqual will change behaviour to follow the DOM specification
unique_key
$num = $node->unique_key;
This function is not specified for any DOM level. It returns a key guaranteed to be unique for this node, and to always be the same value for this node. In other words, two node objects return the same key if and only if isSameNode indicates that they are the same node.
The returned key value is useful as a key in hashes.
nodeValue
$content = $node->nodeValue;
If the node has any content (such as stored in a text node) it can get requested through this function.
NOTE: Element Nodes have no content per definition. To get the text value of an Element use textContent()
instead!
textContent
$content = $node->textContent;
this function returns the content of all text nodes in the descendants of the given node as specified in DOM.
nodeType
$type = $node->nodeType;
Return a numeric value representing the node type of this node.
The module XML::LibXML by default exports constants
for the node types (see the EXPORT section in the
manual page).
unbindNode
$node->unbindNode();
Unbinds the Node from its siblings and Parent, but not from the Document it belongs to. If the node is not inserted into the DOM
afterwards, it will be lost after the program terminates. From a low level view, the unbound node is stripped from the context it is and
inserted into a (hidden) document-fragment.
removeChild
$childnode = $node->removeChild( $childnode );
This will unbind the Child Node from its parent $node. The function returns the unbound node. If
$childnode is not a child of the given Node the function will fail.
replaceChild
$oldnode = $node->replaceChild( $newNode, $oldNode );
Replaces the $oldNode with the $newNode. The $oldNode will be unbound
from the Node. This function differs from the DOM L2 specification, in the case, if the new node is not part of the document, the node will
be imported first.
replaceNode
$node->replaceNode($newNode);
This function is very similar to replaceChild(), but it replaces the node itself rather than a childnode. This is useful if a node
found by any XPath function, should be replaced.
appendChild
$childnode = $node->appendChild( $childnode );
The function will add the $childnode to the end of $node's children. The function should
fail, if the new childnode is already a child of $node. This function differs from the DOM L2 specification, in the
case, if the new node is not part of the document, the node will be imported first.
addChild
$childnode = $node->addChild( $childnode );
As an alternative to appendChild() one can use the addChild() function. This function is a bit faster, because it avoids all DOM
conformity checks. Therefore this function is quite useful if one builds XML documents in memory where the order and ownership (ownerDocument)
is assured.
addChild() uses libxml2's own xmlAddChild() function. Thus it has to be used with extra care: If a text node is added to a node
and the node itself or its last childnode is as well a text node, the node to add will be merged with the one already available. The current
node will be removed from memory after this action. Because perl is not aware of this action, the perl instance is still available.
XML::LibXML will catch the loss of a node and refuse to run any function called on that node.
my $t1 = $doc->createTextNode( "foo" );
my $t2 = $doc->createTextNode( "bar" );
$t1->addChild( $t2 ); # is OK
my $val = $t2->nodeValue(); # will fail, script dies
Also addChild() will not check if the added node belongs to the same document as the node it will be added to. This could lead to
inconsistent documents and in more worse cases even to memory violations, if one does not keep track of this issue.
Although this sounds like a lot of trouble, addChild() is useful if a document is built from a stream, such as happens sometimes in
SAX handlers or filters.
If you are not sure about the source of your nodes, you better stay with appendChild(), because this function is more user friendly in
the sense of being more error tolerant.
addNewChild
$node = $parent->addNewChild( $nsURI, $name );
Similar to addChild(), this function uses low level libxml2 functionality to provide faster interface for DOM
building. addNewChild() uses xmlNewChild() to create a new node on a given parent element.
addNewChild() has two parameters $nsURI and $name, where $nsURI is an (optional) namespace URI. $name is the fully qualified element
name; addNewChild() will determine the correct prefix if necessary.
The function returns the newly created node.
This function is very useful for DOM building, where a created node can be directly associated with its parent. NOTE
this function is not part of the DOM specification and its use will limit your code to XML::LibXML.
addSibling
$node->addSibling($newNode);
addSibling() allows adding an additional node to the end of a nodelist, defined by the given node.
cloneNode
$newnode =$node->cloneNode( $deep );
cloneNode creates a
copy of $node. When $deep is
set to 1 (true) the function will copy all
child nodes as well. If $deep is 0 only the current
node will be copied. Note that in case of element,
attributes are copied even if $deep is 0.
Note that the behavior of this function for $deep=0
has changed in 1.62 in order to be consistent with the DOM spec
(in older versions attributes and namespace information
was not copied for elements).
parentNode
$parentnode = $node->parentNode;
Returns simply the Parent Node of the current node.
nextSibling
$nextnode = $node->nextSibling();
Returns the next sibling if any .
nextNonBlankSibling
$nextnode = $node->nextNonBlankSibling();
Returns the next non-blank sibling if any (a node is blank if it is a Text or CDATA node consisting of whitespace only). This method is not defined by DOM.
previousSibling
$prevnode = $node->previousSibling();
Analogous to getNextSibling the function returns the previous sibling if any.
previousNonBlankSibling
$prevnode = $node->previousNonBlankSibling();
Returns the previous non-blank sibling if any (a node is blank if it is a Text or CDATA node consisting of whitespace only). This method is not defined by DOM.
hasChildNodes
$boolean = $node->hasChildNodes();
If the current node has child nodes this function returns TRUE (1), otherwise it returns FALSE (0, not undef).
firstChild
$childnode = $node->firstChild;
If a node has child nodes this function will return the first node in the child list.
lastChild
$childnode = $node->lastChild;
If the $node has child nodes this function returns the last child node.
ownerDocument
$documentnode = $node->ownerDocument;
Through this function it is always possible to access the document the current node is bound to.
getOwner
$node = $node->getOwner;
This function returns the node the current node is associated with. In most cases this will be a document node or a document fragment
node.
setOwnerDocument
$node->setOwnerDocument( $doc );
This function binds a node to another DOM. This method unbinds the node first, if it is already bound to another document.
This function is the opposite calling of 's adoptNode() function. Because of this it has the same limitations
with Entity References as adoptNode().
insertBefore
$node->insertBefore( $newNode, $refNode );
The method inserts $newNode before $refNode. If $refNode is undefined,
the newNode will be set as the new last child of the parent node. This function differs from the DOM L2 specification, in the case, if the
new node is not part of the document, the node will be imported first, automatically.
$refNode has to be passed to the function even if it is undefined:
$node->insertBefore( $newNode, undef ); # the same as $node->appendChild( $newNode );
$node->insertBefore( $newNode ); # wrong
Note, that the reference node has to be a direct child of the node the function is called on. Also, $newChild is not allowed to be an
ancestor of the new parent node.
insertAfter
$node->insertAfter( $newNode, $refNode );
The method inserts $newNode after $refNode. If $refNode is undefined,
the newNode will be set as the new last child of the parent node.
Note, that $refNode has to be passed explicitly even if it is undef.
findnodes
@nodes = $node->findnodes( $xpath_expression );
findnodes evaluates the xpath expression (XPath 1.0) on the current node and returns the resulting node set as an array. In scalar context, returns an XML::LibXML::NodeList object.
The xpath expression can be passed either as a string, or
as a XML::LibXML::XPathExpression object.
NOTE ON NAMESPACES AND XPATH:
A common mistake about
XPath is to assume that node tests consisting of an
element name with no prefix match elements in the default
namespace. This assumption is wrong - by XPath
specification, such node tests can only match elements
that are in no (i.e. null) namespace.
So, for example, one cannot match the root element of an
XHTML document with $node->find('/html')
since '/html' would only match if the
root element <html> had no
namespace, but all XHTML elements belong to the namespace
http://www.w3.org/1999/xhtml. (Note that
xmlns="..." namespace declarations can
also be specified in a DTD, which makes the situation even worse, since
the XML document looks as if there was no default namespace).
There are several possible ways to deal with namespaces in XPath:
The recommended way is to use the
module
to define an explicit context
for XPath evaluation, in which a document independent
prefix-to-namespace mapping can be defined. For
example:
my $xpc = XML::LibXML::XPathContext->new;
$xpc->registerNs('x', 'http://www.w3.org/1999/xhtml');
$xpc->find('/x:html',$node);
Another possibility is to use prefixes declared
in the queried document (if known).
If the document declares a prefix for the
namespace in question (and the context node is in the
scope of the declaration),
XML::LibXML allows you to use the
prefix in the XPath expression, e.g.:
$node->find('/x:html');
See also XML::LibXML::XPathContext->findnodes.
find
$result = $node->find( $xpath );
find evaluates the XPath 1.0 expression using the current node as the context of the expression, and returns the
result depending on what type of result the XPath expression had. For example, the XPath "1 * 3 + 52" results in a
XML::LibXML::Number object being returned. Other expressions might return an XML::LibXML::Boolean
object, or an XML::LibXML::Literal object (a string). Each of those objects uses Perl's overload feature to "do
the right thing" in different contexts.
The xpath expression can be passed either as a string,
or as a XML::LibXML::XPathExpression object.
See also ->find.
findvalue
print $node->findvalue( $xpath );
findvalue is exactly equivalent to:
$node->find( $xpath )->to_literal;
That is, it returns the literal value of the results. This enables you to ensure that you get a string back from your search, allowing
certain shortcuts. This could be used as the equivalent of XSLT's <xsl:value-of select="some_xpath"/>.
See also ->findvalue.
The xpath expression can be passed either as a string, or
as a XML::LibXML::XPathExpression object.
exists
$bool = $node->exists( $xpath_expression );
This method behaves like findnodes, except
that it only returns a boolean value (1 if the expression matches a node, 0 otherwise)
and may be faster than findnodes, because
the XPath evaluation may stop early on the first match (this is true for libxml2 >= 2.6.27).
For XPath expressions that do not return node-set,
the method returns true if the returned value is a non-zero number or a non-empty string.
childNodes
@childnodes = $node->childNodes();
childNodes implements a more intuitive interface to the childnodes of the current node. It enables you to pass
all children directly to a map or grep. If this function is called in scalar context, a
XML::LibXML::NodeList object will be returned.
nonBlankChildNodes
@childnodes = $node->nonBlankChildNodes();
This is like childNodes,
but returns only non-blank nodes
(where a node is blank if it is a Text or CDATA node consisting of whitespace only). This method is not defined by DOM.
toString
$xmlstring = $node->toString($format,$docencoding);
This method is similar to the method toString of a but for a single node. It returns a string consisting of XML serialization of the given node and all its descendants. Unlike XML::LibXML::Document::toString, in this case the resulting string is by default a character string (UTF-8 encoded with UTF8 flag on). An optional flag $format controls indentation, as in XML::LibXML::Document::toString. If the second optional $docencoding flag is true, the result will be a byte string in the document encoding (see XML::LibXML::Document::actualEncoding).
toStringC14N
$c14nstring = $node->toStringC14N();
$c14nstring = $node->toStringC14N($with_comments, $xpath_expression , $xpath_context);
The function is similar to
toString(). Instead of simply serializing the
document tree, it transforms it as it is specified
in the XML-C14N Specification
(see http://www.w3.org/TR/xml-c14n).
Such transformation is known as
canonization.
If $with_comments is 0 or not defined, the
result-document will not contain any comments that
exist in the original document. To include
comments into the canonized document,
$with_comments has to be set to 1.
The parameter $xpath_expression defines the
nodeset of nodes that should be visible in the
resulting document. This can be used to filter out
some nodes. One has to note, that only the nodes
that are part of the nodeset, will be included
into the result-document. Their child-nodes will
not exist in the resulting document, unless they
are part of the nodeset defined by the xpath
expression.
If $xpath_expression is omitted or empty,
toStringC14N() will include all nodes in the given
sub-tree, using the following XPath expressions:
with comments
(. | .//node() | .//@* | .//namespace::*)
and without comments
(. | .//node() | .//@* | .//namespace::*)[not(self::comment())]
An optional parameter $xpath_context can be used
to pass an object defining
the context for evaluation of $xpath_expression.
This is useful for mapping namespace prefixes used in the XPath expression
to namespace URIs.
Note, however, that
$node will be used as the context node for the evaluation, not
the context node of $xpath_context!
toStringC14N_v1_1
$c14nstring = $node->toStringC14N_v1_1();
$c14nstring = $node->toStringC14N_v1_1($with_comments, $xpath_expression , $xpath_context);
This function behaves like toStringC14N() except that
it uses the "XML_C14N_1_1" constant for
canonicalising using the "C14N 1.1 spec".
toStringEC14N
$ec14nstring = $node->toStringEC14N();
$ec14nstring = $node->toStringEC14N($with_comments, $xpath_expression, $inclusive_prefix_list);
$ec14nstring = $node->toStringEC14N($with_comments, $xpath_expression, $xpath_context, $inclusive_prefix_list);
The function is similar to toStringC14N() but follows
the XML-EXC-C14N Specification (see http://www.w3.org/TR/xml-exc-c14n)
for exclusive canonization of XML.
The arguments $with_comments, $xpath_expression, $xpath_context are as in toStringC14N().
An ARRAY reference can be passed as the last argument $inclusive_prefix_list,
listing namespace prefixes that are to be handled in the manner described by the Canonical XML Recommendation (i.e. preserved in the output even if the namespace is not used). C.f. the spec for details.
serialize
$str = $doc->serialize($format);
An alias for toString(). This function was name added to be more consistent
with libxml2.
serialize_c14n
An alias for toStringC14N().
serialize_exc_c14n
An alias for toStringEC14N().
localname
$localname = $node->localname;
Returns the local name of a tag. This is the part behind the colon.
prefix
$nameprefix = $node->prefix;
Returns the prefix of a tag. This is the part before the colon.
namespaceURI
$uri = $node->namespaceURI();
returns the URI of the current namespace.
hasAttributes
$boolean = $node->hasAttributes();
returns 1 (TRUE) if the current node has any attributes set, otherwise 0 (FALSE) is returned.
attributes
@attributelist = $node->attributes();
This function returns all attributes and namespace declarations assigned to the given node.
Because XML::LibXML does not implement namespace declarations and attributes the same way, it is required to test what kind of node is
handled while accessing the functions result.
If this function is called in array context the attribute nodes are returned as an array. In scalar context, the function will return a
XML::LibXML::NamedNodeMap object.
lookupNamespaceURI
$URI = $node->lookupNamespaceURI( $prefix );
Find a namespace URI by its prefix starting at the current node.
lookupNamespacePrefix
$prefix = $node->lookupNamespacePrefix( $URI );
Find a namespace prefix by its URI starting at the current node.
NOTE Only the namespace URIs are meant to be unique. The prefix is only document related. Also the document might
have more than a single prefix defined for a namespace.
normalize
$node->normalize;
This function normalizes adjacent text nodes. This function is not as strict as libxml2's xmlTextMerge() function, since it will
not free a node that is still referenced by the perl layer.
getNamespaces
@nslist = $node->getNamespaces;
If a node has any namespaces defined, this function will return these namespaces. Note, that this will not return all namespaces that
are in scope, but only the ones declared explicitly for that node.
Although getNamespaces is available for all nodes, it only makes sense if used with element nodes.
removeChildNodes
$node->removeChildNodes();
This function is not specified for any DOM level: It removes all childnodes from a node in a single step. Other than the libxml2
function itself (xmlFreeNodeList), this function will not immediately remove the nodes from the memory. This saves one from getting memory
violations, if there are nodes still referred to from the Perl level.
baseURI ()
$strURI = $node->baseURI();
Searches for the base URL of the node. The method should work on both XML
and HTML documents even if base mechanisms for these are completely different.
It returns the base as defined in RFC 2396 sections
"5.1.1. Base URI within Document Content"
and
"5.1.2. Base URI from the Encapsulating Entity".
However it does not return the document base (5.1.3), use method URI
of XML::LibXML::Document for this.
setBaseURI ($strURI)
$node->setBaseURI($strURI);
This method only does something useful for an element node
in an XML document.
It sets the xml:base attribute on the node to $strURI, which
effectively sets the base URI of the node to the same value.
Note: For HTML documents this behaves as if the document was XML
which may not be desired, since it does not effectively
set the base URI of the node. See RFC 2396 appendix D
for an example of how base URI can be specified in HTML.
nodePath
$node->nodePath();
This function is not specified for any DOM level: It returns a canonical structure based XPath for a given node.
line_number
$lineno = $node->line_number();
This function returns the line number where the tag was found during parsing. If a node is added to the document the line number is 0.
Problems may occur, if a node from one document is passed to another one.
IMPORTANT:
Due to limitations in the libxml2 library line numbers greater than
65535 will be returned as 65535. Please see
http://bugzilla.gnome.org/show_bug.cgi?id=325533 for more details.
Note: line_number() is special to XML::LibXML and not part of the DOM specification.
If the line_numbers flag of the parser was not activated before parsing, line_number() will always return 0.
XML::LibXML Class for Element Nodes
XML::LibXML::Element
Synopsis
use XML::LibXML;
# Only methods specific to Element nodes are listed here,
# see the XML::LibXML::Node manpage for other methods
Methods
The class inherits from .
The documentation for Inherited methods is not listed here.
Many functions listed here are
extensively documented in the DOM Level 3 specification. Please refer to
the specification for extensive documentation.
new
$node = XML::LibXML::Element->new( $name );
This function creates a new node unbound to any DOM.
setAttribute
$node->setAttribute( $aname, $avalue );
This method sets or replaces the node's attribute $aname to the value $avalue
setAttributeNS
$node->setAttributeNS( $nsURI, $aname, $avalue );
Namespace-aware version of setAttribute, where
$nsURI is a namespace URI,
$aname is a qualified name,
and $avalue is the value.
The namespace URI may be null (empty or undefined)
in order to create an attribute which has no namespace.
The current implementation differs from DOM in the following aspects
If an attribute with the same local name and namespace URI already exists
on the element, but its prefix differs from the prefix of $aname,
then this function is supposed to change the prefix (regardless
of namespace declarations and possible collisions).
However, the current implementation does rather the opposite.
If a prefix is declared for the namespace URI in the scope
of the attribute, then the already declared prefix is used,
disregarding the prefix specified in $aname.
If no prefix is declared for the namespace, the function tries
to declare the prefix specified in $aname
and dies if the prefix is already taken by some other namespace.
According to DOM Level 2 specification, this method can also be used to
create or modify special attributes used for declaring XML namespaces
(which belong to the namespace "http://www.w3.org/2000/xmlns/" and
have prefix or name "xmlns"). This should work since version 1.61,
but again the implementation differs from DOM specification in the following:
if a declaration of the same namespace prefix already exists
on the element, then changing its value via this method
automatically changes the namespace of all elements and attributes
in its scope. This is because in libxml2 the namespace URI of an element
is not static but is computed from a pointer to a namespace declaration attribute.
getAttribute
$avalue = $node->getAttribute( $aname );
If $node has an attribute with the name $aname, the value of this attribute will get
returned.
getAttributeNS
$avalue = $node->getAttributeNS( $nsURI, $aname );
Retrieves an attribute value by local name and namespace URI.
getAttributeNode
$attrnode = $node->getAttributeNode( $aname );
Retrieve an attribute node by name. If no attribute with a given name exists, undef is returned.
getAttributeNodeNS
$attrnode = $node->getAttributeNodeNS( $namespaceURI, $aname );
Retrieves an attribute node by local name and namespace URI. If no attribute with a given localname and namespace exists, undef is returned.
removeAttribute
$node->removeAttribute( $aname );
The method removes the attribute $aname from the node's attribute list, if the attribute can be found.
removeAttributeNS
$node->removeAttributeNS( $nsURI, $aname );
Namespace version of removeAttribute
hasAttribute
$boolean = $node->hasAttribute( $aname );
This function tests if the named attribute is set for the node. If the attribute is specified, TRUE (1) will be returned, otherwise the
return value is FALSE (0).
hasAttributeNS
$boolean = $node->hasAttributeNS( $nsURI, $aname );
namespace version of hasAttribute
getChildrenByTagName
@nodes = $node->getChildrenByTagName($tagname);
The function gives direct access to all child elements of the current node with a given tagname, where
tagname is a qualified name, that is, in case of namespace usage it may consist of a prefix and local
name. This function makes things a lot easier if one needs
to handle big data sets. A special tagname '*' can be used to match any name.
If this function is called in SCALAR context, it returns the number of elements found.
getChildrenByTagNameNS
@nodes = $node->getChildrenByTagNameNS($nsURI,$tagname);
Namespace version of getChildrenByTagName. A special nsURI '*' matches any namespace URI,
in which case the function behaves just like getChildrenByLocalName.
If this function is called in SCALAR context, it returns the number of elements found.
getChildrenByLocalName
@nodes = $node->getChildrenByLocalName($localname);
The function gives direct access to all child elements of the current node with a given local name. It makes things a lot easier if one needs
to handle big data sets. A special localname '*' can be used to match any local name.
If this function is called in SCALAR context, it returns the number of elements found.
getElementsByTagName
@nodes = $node->getElementsByTagName($tagname);
This function is part of the spec. It
fetches all descendants of a node with a given tagname,
where tagname is a qualified name,
that is, in case of namespace usage it may consist of a prefix and
local name.
A special tagname '*' can be used to match any tag name.
In SCALAR context this function returns an XML::LibXML::NodeList object.
getElementsByTagNameNS
@nodes = $node->getElementsByTagNameNS($nsURI,$localname);
Namespace version of getElementsByTagName as found in the DOM spec.
A special localname '*' can be used to match any local name
and nsURI '*' can be used to match any namespace URI.
In SCALAR context this function returns an XML::LibXML::NodeList object.
getElementsByLocalName
@nodes = $node->getElementsByLocalName($localname);
This function is not found in the DOM specification. It is a mix of getElementsByTagName and getElementsByTagNameNS. It will fetch all
tags matching the given local-name. This allows one to select tags with the same local name across namespace borders.
In SCALAR context this function returns an XML::LibXML::NodeList object.
appendWellBalancedChunk
$node->appendWellBalancedChunk( $chunk );
Sometimes it is necessary to append a string coded XML Tree to a node. appendWellBalancedChunk will do the trick
for you. But this is only done if the String is well-balanced.
Note that appendWellBalancedChunk() is only left for compatibility reasons. Implicitly it uses
my $fragment = $parser->parse_balanced_chunk( $chunk );
$node->appendChild( $fragment );
This form is more explicit and makes it easier to control the flow of a script.
appendText
$node->appendText( $PCDATA );
alias for appendTextNode().
appendTextNode
$node->appendTextNode( $PCDATA );
This wrapper function lets you add a string directly to an element node.
appendTextChild
$node->appendTextChild( $childname , $PCDATA );
Somewhat similar with appendTextNode: It lets you set an Element, that contains only a text node
directly by specifying the name and the text content.
setNamespace
$node->setNamespace( $nsURI , $nsPrefix, $activate );
setNamespace() allows one to apply a
namespace to an element. The function takes three
parameters: 1. the namespace URI, which is
required and the two optional values prefix, which
is the namespace prefix, as it should be used in
child elements or attributes as well as the
additional activate parameter. If prefix is not given,
undefined or empty, this function tries to create a
declaration of the default namespace.
The activate parameter is most useful: If
this parameter is set to FALSE (0), a new namespace
declaration is simply added to the element
while the element's namespace itself is not
altered. Nevertheless, activate is set to TRUE (1)
on default. In this case the namespace
is used as the node's effective
namespace. This means the namespace prefix is
added to the node name and if there was a
namespace already active for the node, it will
be replaced (but its declaration is not removed from the document).
A new namespace declaration is only created if necessary
(that is, if the element is already in the scope
of a namespace declaration associating the prefix
with the namespace URI, then this declaration is reused).
The following example may clarify this:
my $e1 = $doc->createElement("bar");
$e1->setNamespace("http://foobar.org", "foo")
results
<foo:bar xmlns:foo="http://foobar.org"/>
while
my $e2 = $doc->createElement("bar");
$e2->setNamespace("http://foobar.org", "foo",0)
results only
<bar xmlns:foo="http://foobar.org"/>
By using $activate == 0 it is possible to
create multiple namespace declarations on a single
element.
The function fails if it is required to
create a declaration associating the prefix
with the namespace URI but the element already
carries a declaration with the same prefix but
different namespace URI.
setNamespaceDeclURI
$node->setNamespaceDeclURI( $nsPrefix, $newURI );
EXPERIMENTAL IN 1.61 !
This function manipulates
directly with an existing namespace
declaration on an element. It takes
two parameters: the prefix by which it
looks up the namespace declaration and
a new namespace URI which replaces its previous
value.
It returns 1 if the namespace declaration
was found and changed, 0 otherwise.
All elements and attributes (even those previously
unbound from the document) for which the
namespace declaration determines their namespace
belong to the new namespace after
the change.
If the new URI is undef or empty, the nodes
have no namespace and no prefix after the change.
Namespace declarations
once nulled in this way do not
further appear in the serialized output
(but do remain in the document for internal integrity
of libxml2 data structures).
This function is NOT part of any DOM API.
setNamespaceDeclPrefix
$node->setNamespaceDeclPrefix( $oldPrefix, $newPrefix );
EXPERIMENTAL IN 1.61 !
This function manipulates
directly with an existing namespace
declaration on an element. It takes
two parameters: the old prefix by which it
looks up the namespace declaration and
a new prefix which is to replace the old one.
The function dies with an error
if the element is in the scope of
another declaration whose prefix equals
to the new prefix, or if the change should
result in a declaration with a non-empty prefix but
empty namespace URI.
Otherwise, it returns 1 if the namespace declaration
was found and changed and 0 if not found.
All elements and attributes (even those previously
unbound from the document) for which the
namespace declaration determines their namespace
change their prefix to the new value.
If the new prefix is undef or empty,
the namespace declaration becomes
a declaration of a default namespace.
The corresponding nodes drop their namespace prefix
(but remain in the, now default, namespace).
In this case the function fails, if the containing element
is in the scope of another default namespace declaration.
This function is NOT part of any DOM API.
Overloading
XML::LibXML::Element overloads hash dereferencing to
provide access to the element's attributes. For non-namespaced
attributes, the attribute name is the hash key, and the attribute
value is the hash value. For namespaced attributes, the hash key
is qualified with the namespace URI, using Clark notation.
Perl's "tied hash" feature is used, which means that the
hash gives you read-write access to the element's attributes.
For more information, see XML::LibXML::AttributeHash
XML::LibXML Class for Text Nodes
XML::LibXML::Text
Synopsis
use XML::LibXML;
# Only methods specific to Text nodes are listed here,
# see the XML::LibXML::Node manpage for other methods
Description
Unlike the DOM specification, XML::LibXML implements the text node as the base class of all character data node. Therefore there exists no
CharacterData class. This allows one to apply methods of text nodes also to Comments and CDATA-sections.
Methods
The class inherits from .
The documentation for Inherited methods is not listed here.
Many functions listed here are
extensively documented in the DOM Level 3 specification. Please refer to
the specification for extensive documentation.
new
$text = XML::LibXML::Text->new( $content );
The constructor of the class. It creates an unbound text node.
data
$nodedata = $text->data;
Although there exists the nodeValue attribute in the Node class, the DOM specification defines data as a separate
attribute. XML::LibXML implements these two attributes not as different attributes, but as aliases, such as
libxml2 does. Therefore
$text->data;
and
$text->nodeValue;
will have the same result and are not different entities.
setData($string)
$text->setData( $text_content );
This function sets or replaces text content to a node. The node has to be of the type "text", "cdata" or
"comment".
substringData($offset,$length)
$text->substringData($offset, $length);
Extracts a range of data from the node. (DOM Spec) This function takes the two parameters $offset and $length and returns the
sub-string, if available.
If the node contains no data or $offset refers to an non-existing string index, this function will return undef.
If $length is out of range substringData will return the data starting at $offset instead of causing an error.
appendData($string)
$text->appendData( $somedata );
Appends a string to the end of the existing data. If the current text node contains no data, this function has the same effect as
setData.
insertData($offset,$string)
$text->insertData($offset, $string);
Inserts the parameter $string at the given $offset of the existing data of the node. This operation will not remove existing data, but
change the order of the existing data.
The $offset has to be a positive value. If $offset is out of range, insertData will have the same behaviour as
appendData.
deleteData($offset, $length)
$text->deleteData($offset, $length);
This method removes a chunk from the existing node data at the given offset. The $length parameter tells, how many characters should
be removed from the string.
deleteDataString($string, [$all])
$text->deleteDataString($remstring, $all);
This method removes a chunk from the existing node data. Since the DOM spec is quite unhandy if you already know which
string to remove from a text node, this method allows more perlish code :)
The functions takes two parameters: $string and optional the $all flag. If $all is not set,
undef or 0, deleteDataString will remove only the first occurrence of
$string. If $all is TRUE deleteDataString will remove all occurrences of $string
from the node data.
replaceData($offset, $length, $string)
$text->replaceData($offset, $length, $string);
The DOM style version to replace node data.
replaceDataString($oldstring, $newstring, [$all])
$text->replaceDataString($old, $new, $flag);
The more programmer friendly version of replaceData() :)
Instead of giving offsets and length one can specify
the exact string ($oldstring) to
be replaced. Additionally the $all
flag allows one to replace all occurrences of
$oldstring.
replaceDataRegEx( $search_cond, $replace_cond, $reflags )
$text->replaceDataRegEx( $search_cond, $replace_cond, $reflags );
This method replaces the node's data by a
simple regular expression.
Optional, this function allows one to pass some flags
that will be added as flag to the replace
statement.
NOTE: This is a shortcut for
my $datastr = $node->getData();
$datastr =~ s/somecond/replacement/g; # 'g' is just an example for any flag
$node->setData( $datastr );
This function can make things easier to read for simple replacements. For more complex variants it is recommended to use the code
snippet above.
XML::LibXML Class for CDATA Sections
XML::LibXML::CDATASection
Synopsis
use XML::LibXML;
# Only methods specific to CDATA nodes are listed here,
# see the XML::LibXML::Node manpage for other methods
Description
This class provides all functions of , but for CDATA nodes.
Methods
The class inherits from .
The documentation for Inherited methods is not listed here.
Many functions listed here are
extensively documented in the DOM Level 3 specification. Please refer to
the specification for extensive documentation.
new
$node = XML::LibXML::CDATASection->new( $content );
The constructor is the only provided function for this package. It is required, because libxml2 treats the
different text node types slightly differently.
XML::LibXML Attribute Class
XML::LibXML::Attr
Synopsis
use XML::LibXML;
# Only methods specific to Attribute nodes are listed here,
# see the XML::LibXML::Node manpage for other methods
Description
This is the interface to handle Attributes like ordinary nodes. The naming of the class relies on the W3C DOM documentation.
Methods
The class inherits from .
The documentation for Inherited methods is not listed here.
Many functions listed here are
extensively documented in the DOM Level 3 specification. Please refer to
the specification for extensive documentation.
new
$attr = XML::LibXML::Attr->new($name [,$value]);
Class constructor. If you need to work with ISO encoded strings, you should always use the createAttribute
of .
getValue
$string = $attr->getValue();
Returns the value stored for the attribute. If undef is returned, the attribute has no value, which is different of being
not specified.
value
$string = $attr->value;
Alias for getValue()
setValue
$attr->setValue( $string );
This is needed to set a new attribute value. If ISO encoded strings are passed as parameter, the node has to be bound to a document,
otherwise the encoding might be done incorrectly.
getOwnerElement
$node = $attr->getOwnerElement();
returns the node the attribute belongs to. If the attribute is not bound to a node, undef will be returned. Overwriting the underlying
implementation, the parentNode function will return undef, instead of the owner element.
setNamespace
$attr->setNamespace($nsURI, $prefix);
This function tries to bound the attribute to a given namespace.
If $nsURI is undefined or empty,
the function discards any previous association of the attribute with a namespace.
If the namespace was not previously declared in the context of the
attribute, this function will fail.
In this case you may wish to call setNamespace() on the ownerElement.
If the namespace URI is non-empty and
declared in the context of the attribute, but only with a different
(non-empty) prefix, then the attribute is still bound to the namespace
but gets a different prefix than $prefix.
The function also fails if the prefix is empty but the namespace URI
is not (because unprefixed attributes should by definition belong to
no namespace).
This function returns 1 on success, 0 otherwise.
isId
$bool = $attr->isId;
Determine whether an attribute is of type
ID. For documents with a DTD, this information
is only available if DTD loading/validation has been requested.
For HTML documents parsed with the HTML
parser ID detection is done
automatically. In XML documents, all "xml:id"
attributes are considered to be of type ID.
serializeContent($docencoding)
$string = $attr->serializeContent;
This function is not part of DOM API. It returns attribute content
in the form in which it serializes into XML, that is
with all meta-characters properly quoted and with raw
entity references (except for entities expanded during parse time).
Setting the optional $docencoding flag to 1 enforces document
encoding for the output string (which is then passed to Perl as a
byte string). Otherwise the string is passed to Perl as (UTF-8 encoded)
characters.
XML::LibXML's DOM L2 Document Fragment Implementation
XML::LibXML::DocumentFragment
Synopsis
use XML::LibXML;
Description
This class is a helper class as described in the DOM Level 2 Specification. It is implemented as a node without name. All adding, inserting or
replacing functions are aware of document fragments now.
As well all unbound nodes (all nodes that do not belong to any document sub-tree) are implicit members of document fragments.
XML::LibXML Namespace Implementation
XML::LibXML::Namespace
Synopsis
use XML::LibXML;
# Only methods specific to Namespace nodes are listed here,
# see the XML::LibXML::Node manpage for other methods
Description
Namespace nodes are returned by both $element->findnodes('namespace::foo') or by $node->getNamespaces().
The namespace node API is not part of any current DOM API, and so it is quite minimal. It should be noted that namespace nodes are
not a sub class of , however Namespace nodes act a lot like attribute nodes, and similarly named methods will
return what you would expect if you treated the namespace node as an attribute. Note that in order to fix several inconsistencies between the API and the documentation, the behavior of some functions have been changed in 1.64.
Methods
new
my $ns = XML::LibXML::Namespace->new($nsURI);
Creates a new Namespace node. Note that this is not a 'node' as an attribute or an element node. Therefore you can't do
call all Functions. All functions available for this node are listed below.
Optionally you can pass the prefix to the namespace constructor. If this second parameter is omitted you will create a so called
default namespace. Note, the newly created namespace is not bound to any document or node, therefore you should not expect it to be
available in an existing document.
declaredURI
Returns the URI for this namespace.
declaredPrefix
Returns the prefix for this namespace.
nodeName
print $ns->nodeName();
Returns "xmlns:prefix", where prefix is the prefix for this namespace.
name
print $ns->name();
Alias for nodeName()
getLocalName
$localname = $ns->getLocalName();
Returns the local name of this node as if it were an attribute, that is, the prefix associated with the namespace.
getData
print $ns->getData();
Returns the URI of the namespace, i.e. the value of this node as if it were an attribute.
getValue
print $ns->getValue();
Alias for getData()
value
print $ns->value();
Alias for getData()
getNamespaceURI
$known_uri = $ns->getNamespaceURI();
Returns the string "http://www.w3.org/2000/xmlns/"
getPrefix
$known_prefix = $ns->getPrefix();
Returns the string "xmlns"
unique_key
$key = $ns->unique_key();
This method returns a key guaranteed to be unique for this namespace, and to always be the same value for this namespace. Two namespace objects return the same key if and only if they have the same prefix and the same URI. The returned key value is useful as a key in hashes.
XML::LibXML Processing Instructions
XML::LibXML::PI
Synopsis
use XML::LibXML;
# Only methods specific to Processing Instruction nodes are listed here,
# see the XML::LibXML::Node manpage for other methods
Description
Processing instructions are implemented with XML::LibXML with read and write access. The PI data is the PI without the PI target (as specified in
XML 1.0 [17]) as a string. This string can be accessed with getData as implemented in .
The write access is aware about the fact, that many processing instructions have attribute like data. Therefore setData() provides besides the DOM
spec conform Interface to pass a set of named parameter. So the code segment
my $pi = $dom->createProcessingInstruction("abc");
$pi->setData(foo=>'bar', foobar=>'foobar');
$dom->appendChild( $pi );
will result the following PI in the DOM:
<?abc foo="bar" foobar="foobar"?>
Which is how it is specified in the DOM specification. This three step interface creates temporary a node in perl space. This can be avoided while
using the insertProcessingInstruction() method. Instead of the three calls described above, the call
$dom->insertProcessingInstruction("abc",'foo="bar" foobar="foobar"');
will have the same result as above.
's implementation of setData() documented below differs a bit from the standard version as available in :
setData
$pinode->setData( $data_string );
$pinode->setData( name=>string_value [...] );
This method allows one to change the content data of
a PI. Additionally to the interface specified for DOM
Level2, the method provides a named parameter
interface to set the data. This parameter list is
converted into a string before it is appended to the
PI.
XML::LibXML DTD Handling
XML::LibXML::Dtd
Synopsis
use XML::LibXML;
Description
This class holds a DTD. You may parse a DTD from either a string, or from an external SYSTEM identifier.
No support is available as yet for parsing from a filehandle.
XML::LibXML::Dtd is a sub-class of , so all the methods available to nodes (particularly toString()) are available to Dtd objects.
Methods
new
$dtd = XML::LibXML::Dtd->new($public_id, $system_id);
Parse a DTD from the system identifier, and return a DTD object that you can pass to $doc->is_valid() or $doc->validate().
my $dtd = XML::LibXML::Dtd->new(
"SOME // Public / ID / 1.0",
"test.dtd"
);
my $doc = XML::LibXML->new->parse_file("test.xml");
$doc->validate($dtd);
parse_string
$dtd = XML::LibXML::Dtd->parse_string($dtd_str);
The same as new() above, except you can parse a DTD from a string. Note that parsing from string may fail if the DTD contains external parametric-entity references with relative URLs.
getName
$publicId = $dtd->getName();
Returns the name of DTD; i.e., the name immediately following the DOCTYPE keyword.
publicId
$publicId = $dtd->publicId();
Returns the public identifier of the external subset.
systemId
$systemId = $dtd->systemId();
Returns the system identifier of the external subset.
XML::LibXML Class for Input Callbacks
XML::LibXML::InputCallback
Synopsis
use XML::LibXML;
Synopsis
my $input_callbacks = XML::LibXML::InputCallback->new();
$input_callbacks->register_callbacks([ $match_cb1, $open_cb1,
$read_cb1, $close_cb1 ] );
$input_callbacks->register_callbacks([ $match_cb2, $open_cb2,
$read_cb2, $close_cb2 ] );
$input_callbacks->register_callbacks( [ $match_cb3, $open_cb3,
$read_cb3, $close_cb3 ] );
$parser->input_callbacks( $input_callbacks );
$parser->parse_file( $some_xml_file );
Description
You may get unexpected results if you are trying to load external documents during libxml2 parsing if the location of the resource is not a
HTTP, FTP or relative location but a absolute path for example. To get around this limitation, you may add your own input handler to open, read and
close particular types of locations or URI classes. Using this input callback handlers, you can handle your own custom URI schemes for example.
The input callbacks are used whenever XML::LibXML has to get something other than externally parsed entities from somewhere. They are implemented
using a callback stack on the Perl layer in analogy to libxml2's native callback stack.
The XML::LibXML::InputCallback class transparently registers the input callbacks for the libxml2's parser processes.
How does XML::LibXML::InputCallback work?
The libxml2 library offers a callback implementation as global functions only. To work-around the troubles resulting in having only global
callbacks - for example, if the same global callback stack is manipulated by different applications running together in a single Apache
Web-server environment -, XML::LibXML::InputCallback comes with a object-oriented and a function-oriented part.
Using the function-oriented part the global callback stack of libxml2 can be manipulated. Those functions can be used as interface to the
callbacks on the C- and XS Layer. At the object-oriented part, operations for working with the "pseudo-localized" callback stack are
implemented. Currently, you can register and de-register callbacks on the Perl layer and initialize them on a per parser basis.
Callback Groups
The libxml2 input callbacks come in groups. One group contains a URI matcher (match), a data stream constructor (open),
a data stream reader (read), and a data stream destructor (close). The callbacks can be
manipulated on a per group basis only.
The Parser Process
The parser process works on an XML data stream, along which, links to other resources can be embedded. This can be links to external
DTDs or XIncludes for example. Those resources are identified by URIs. The callback implementation of libxml2 assumes that one callback
group can handle a certain amount of URIs and a certain URI scheme. Per default, callback handlers for file://*,
file:://*.gz, http://* and ftp://* are registered.
Callback groups in the callback stack are processed from top to bottom, meaning that callback groups registered later will be
processed before the earlier registered ones.
While parsing the data stream, the libxml2 parser checks if a registered callback group will handle a URI - if they will not, the URI
will be interpreted as file://URI. To handle a URI, the match callback will have to return
'1'. If that happens, the handling of the URI will be passed to that callback group. Next, the URI will be passed to the
open callback, which should return a reference to the data stream if it successfully opened the
file, '0' otherwise. If opening the stream was successful, the read callback will be called repeatedly until it
returns an empty string. After the read callback, the close callback will be called to close the stream.
Organisation of callback groups in XML::LibXML::InputCallback
Callback groups are implemented as a stack (Array),
each entry holds a reference to an array of the
callbacks. For the libxml2 library, the
XML::LibXML::InputCallback callback implementation
appears as one single callback group. The Perl
implementation however allows one to manage different
callback stacks on a per libxml2-parser basis.
Using XML::LibXML::InputCallback
After object instantiation using the parameter-less constructor, you can register callback groups.
my $input_callbacks = XML::LibXML::InputCallback->new();
$input_callbacks->register_callbacks([ $match_cb1, $open_cb1,
$read_cb1, $close_cb1 ] );
$input_callbacks->register_callbacks([ $match_cb2, $open_cb2,
$read_cb2, $close_cb2 ] );
$input_callbacks->register_callbacks( [ $match_cb3, $open_cb3,
$read_cb3, $close_cb3 ] );
$parser->input_callbacks( $input_callbacks );
$parser->parse_file( $some_xml_file );
What about the old callback system prior to XML::LibXML::InputCallback?
In XML::LibXML versions prior to 1.59 - i.e. without the XML::LibXML::InputCallback module - you could define your callbacks either using
globally or locally. You still can do that using XML::LibXML::InputCallback, and in addition to that you can define the callbacks on a per
parser basis!
If you use the old callback interface through global callbacks, XML::LibXML::InputCallback will treat them with a lower priority as the
ones registered using the new interface. The global callbacks will not override the callback groups registered using the new interface. Local
callbacks are attached to a specific parser instance, therefore they are treated with highest priority. If the match
callback of the callback group registered as local variable is identical to one of the callback groups registered using the new interface, that
callback group will be replaced.
Users of the old callback implementation whose open callback returned a plain string, will have to adapt their code
to return a reference to that string after upgrading to version >= 1.59. The new callback system can only deal with the
open callback returning a reference!
Interface Description
Global Variables
$_CUR_CB
Stores the current callback and can be used as shortcut to access the callback stack.
@_GLOBAL_CALLBACKS
Stores all callback groups for the current parser process.
@_CB_STACK
Stores the currently used callback group. Used to prevent parser errors when dealing with nested XML data.
Global Callbacks
_callback_match
Implements the interface for the match callback at C-level and for the selection of the callback group
from the callbacks defined at the Perl-level.
_callback_open
Forwards the open callback from libxml2 to the corresponding callback function at the Perl-level.
_callback_read
Forwards the read request to the corresponding callback function at the Perl-level and returns the result to libxml2.
_callback_close
Forwards the close callback from libxml2 to the corresponding callback function at the Perl-level..
Class methods
new()
A simple constructor.
register_callbacks( [ $match_cb, $open_cb, $read_cb, $close_cb ])
The four callbacks have to be given as array reference in the above order match,
open, read, close!
unregister_callbacks( [ $match_cb, $open_cb, $read_cb, $close_cb ])
With no arguments given, unregister_callbacks() will delete the last registered callback group from the
stack. If four callbacks are passed as array reference, the callback group to unregister will be identified by the
match callback and deleted from the callback stack. Note that if several identical match
callbacks are defined in different callback groups, ALL of them will be deleted from the stack.
init_callbacks( $parser )
Initializes the callback system for the provided parser before starting a parsing process.
cleanup_callbacks()
Resets global variables and the libxml2 callback stack.
lib_init_callbacks()
Used internally for callback registration at C-level.
lib_cleanup_callbacks()
Used internally for callback resetting at the C-level.
Example callbacks
The following example is a purely fictitious example that uses a MyScheme::Handler object that responds to methods similar to an IO::Handle.
# Define the four callback functions
sub match_uri {
my $uri = shift;
return $uri =~ /^myscheme:/; # trigger our callback group at a 'myscheme' URIs
}
sub open_uri {
my $uri = shift;
my $handler = MyScheme::Handler->new($uri);
return $handler;
}
# The returned $buffer will be parsed by the libxml2 parser
sub read_uri {
my $handler = shift;
my $length = shift;
my $buffer;
read($handler, $buffer, $length);
return $buffer; # $buffer will be an empty string '' if read() is done
}
# Close the handle associated with the resource.
sub close_uri {
my $handler = shift;
close($handler);
}
# Register them with a instance of XML::LibXML::InputCallback
my $input_callbacks = XML::LibXML::InputCallback->new();
$input_callbacks->register_callbacks([ \&match_uri, \&open_uri,
\&read_uri, \&close_uri ] );
# Register the callback group at a parser instance
$parser->input_callbacks( $input_callbacks );
# $some_xml_file will be parsed using our callbacks
$parser->parse_file( $some_xml_file );
RelaxNG Schema Validation
XML::LibXML::RelaxNG
Synopsis
use XML::LibXML;
$doc = XML::LibXML->new->parse_file($url);
Description
The XML::LibXML::RelaxNG class is a tiny frontend to libxml2's RelaxNG implementation. Currently it supports only schema parsing and document
validation.
Methods
new
$rngschema = XML::LibXML::RelaxNG->new( location => $filename_or_url, no_network => 1 );
$rngschema = XML::LibXML::RelaxNG->new( string => $xmlschemastring, no_network => 1 );
$rngschema = XML::LibXML::RelaxNG->new( DOM => $doc, no_network => 1 );
The constructor of XML::LibXML::RelaxNG needs to be called with list of parameters. At least location, string or DOM parameter is required to
specify source of schema. Optional parameter no_network set to 1 cause that parser would not access network and optional parameter recover
set 1 cause that parser would not call die() on errors.
It is important, that each schema only have a single source.
The location parameter allows one to parse a schema
from the filesystem or a (non-HTTPS) URL.
The string parameter will parse the schema from the given XML string.
The DOM parameter allows one to parse the schema from a pre-parsed .
Note that the constructor will die() if the schema does not meed the constraints of the RelaxNG specification.
validate
eval { $rngschema->validate( $doc ); };
This function allows one to validate a (parsed)
document against the given RelaxNG schema. The argument
of this function should be an XML::LibXML::Document
object. If this function succeeds, it will return 0,
otherwise it will die() and report the errors found.
Because of this validate() should be always
evaluated.
XML Schema Validation
XML::LibXML::Schema
Synopsis
use XML::LibXML;
$doc = XML::LibXML->new->parse_file($url);
Description
The XML::LibXML::Schema class is a tiny frontend to libxml2's XML Schema implementation. Currently it supports only schema parsing and
document validation. As of 2.6.32, libxml2 only supports decimal types up to 24 digits (the standard requires at least 18).
Methods
new
$xmlschema = XML::LibXML::Schema->new( location => $filename_or_url, no_network => 1 );
$xmlschema = XML::LibXML::Schema->new( string => $xmlschemastring, no_network => 1 );
The constructor of XML::LibXML::Schema needs to be called with list of parameters. At least location or string parameter is required to
specify source of schema. Optional parameter no_network set to 1 cause that parser would not access network and optional parameter recover
set 1 cause that parser would not call die() on errors.
It is important, that each schema only have a single source.
The location parameter allows one to parse a schema
from the filesystem or a (non-HTTPS) URL.
The string parameter will parse the schema from the given XML string.
Note that the constructor will die() if the schema does not meed the constraints of the XML Schema specification.
validate
eval { $xmlschema->validate( $doc ); };
This function allows one to validate a (parsed)
document against the given XML Schema. The argument of
this function should be a object. If this
function succeeds, it will return 0, otherwise it will
die() and report the errors found. Because of this
validate() should be always evaluated.
XPath Evaluation
XML::LibXML::XPathContext
Description
The XML::LibXML::XPathContext
class provides an almost complete
interface to libxml2's XPath implementation.
With XML::LibXML::XPathContext, it is possible to
evaluate XPath expressions in the context
of arbitrary node, context size, and context position,
with a user-defined namespace-prefix mapping,
custom XPath functions written in Perl, and
even a custom XPath variable resolver.
Examples
Namespaces
This example demonstrates registerNs() method.
It finds all paragraph nodes in an XHTML document.
my $xc = XML::LibXML::XPathContext->new($xhtml_doc);
$xc->registerNs('xhtml', 'http://www.w3.org/1999/xhtml');
my @nodes = $xc->findnodes('//xhtml:p');
Custom XPath functions
This example demonstrates registerFunction() method
by defining a function filtering nodes based on a Perl regular expression:
sub grep_nodes {
my ($nodelist,$regexp) = @_;
my $result = XML::LibXML::NodeList->new;
for my $node ($nodelist->get_nodelist()) {
$result->push($node) if $node->textContent =~ $regexp;
}
return $result;
};
my $xc = XML::LibXML::XPathContext->new($node);
$xc->registerFunction('grep_nodes', \&grep_nodes);
my @nodes = $xc->findnodes('//section[grep_nodes(para,"\bsearch(ing|es)?\b")]');
Variables
This example demonstrates registerVarLookup()
method. We use XPath variables to recycle results of previous evaluations:
sub var_lookup {
my ($varname,$ns,$data)=@_;
return $data->{$varname};
}
my $areas = XML::LibXML->new->parse_file('areas.xml');
my $empl = XML::LibXML->new->parse_file('employees.xml');
my $xc = XML::LibXML::XPathContext->new($empl);
my %variables = (
A => $xc->find('/employees/employee[@salary>10000]'),
B => $areas->find('/areas/area[district='Brooklyn']/street'),
);
# get names of employees from $A working in an area listed in $B
$xc->registerVarLookupFunc(\&var_lookup, \%variables);
my @nodes = $xc->findnodes('$A[work_area/street = $B]/name');
Methods
new
my $xpc = XML::LibXML::XPathContext->new();
Creates a new XML::LibXML::XPathContext object
without a context node.
my $xpc = XML::LibXML::XPathContext->new($node);
Creates a new XML::LibXML::XPathContext object with
the context node set to $node.
registerNs
$xpc->registerNs($prefix, $namespace_uri)
Registers namespace $prefix to
$namespace_uri.
unregisterNs
$xpc->unregisterNs($prefix)
Unregisters namespace $prefix.
lookupNs
$uri = $xpc->lookupNs($prefix)
Returns namespace URI registered with
$prefix. If $prefix
is not registered to any namespace URI returns
undef.
registerVarLookupFunc
$xpc->registerVarLookupFunc($callback, $data)
Registers variable lookup function
$callback. The registered function is
executed by the XPath engine each time an XPath variable
is evaluated. It takes three arguments:
$data, variable name, and variable
ns-URI and must return one value: a number or string or
any XML::LibXML:: object that can be a result
of findnodes: Boolean, Literal, Number, Node
(e.g. Document, Element, etc.), or NodeList. For
convenience, simple (non-blessed) array references
containing only objects can be
used instead of an XML::LibXML::NodeList.
getVarLookupData
$data = $xpc->getVarLookupData();
Returns the data that have been associated with a
variable lookup function during a previous call to
registerVarLookupFunc.
getVarLookupFunc
$callback = $xpc->getVarLookupFunc();
Returns the variable lookup function previously registered with
registerVarLookupFunc.
unregisterVarLookupFunc
$xpc->unregisterVarLookupFunc($name);
Unregisters variable lookup function and the associated lookup data.
registerFunctionNS
$xpc->registerFunctionNS($name, $uri, $callback)
Registers an extension function
$name in $uri
namespace. $callback must be a CODE
reference. The arguments of the callback function are
either simple scalars or XML::LibXML::* objects
depending on the XPath argument types. The function is
responsible for checking the argument number and
types. Result of the callback code must be a single
value of the following types: a simple scalar
(number, string) or an arbitrary XML::LibXML::*
object that can be a result of findnodes: Boolean,
Literal, Number, Node (e.g. Document, Element, etc.), or
NodeList. For convenience, simple (non-blessed) array
references containing only
objects can be used instead of a
XML::LibXML::NodeList.
unregisterFunctionNS
$xpc->unregisterFunctionNS($name, $uri)
Unregisters extension function $name
in $uri namespace. Has the same
effect as passing undef as
$callback to
registerFunctionNS.
registerFunction
$xpc->registerFunction($name, $callback)
Same as registerFunctionNS but
without a namespace.
unregisterFunction
$xpc->unregisterFunction($name)
Same as unregisterFunctionNS but
without a namespace.
findnodes
@nodes = $xpc->findnodes($xpath)
@nodes = $xpc->findnodes($xpath, $context_node )
$nodelist = $xpc->findnodes($xpath, $context_node )
Performs the xpath statement on the current node and
returns the result as an array. In scalar context,
returns an XML::LibXML::NodeList object. Optionally, a
node may be passed as a second argument to set the
context node for the query.
The xpath expression can be passed either as a string, or
as a XML::LibXML::XPathExpression object.
find
$object = $xpc->find($xpath )
$object = $xpc->find($xpath, $context_node )
Performs the xpath expression using the current node
as the context of the expression, and returns the result
depending on what type of result the XPath expression
had. For example, the XPath 1 * 3 +
52 results in an XML::LibXML::Number object
being returned. Other expressions might return a
XML::LibXML::Boolean object, or a
XML::LibXML::Literal object (a string). Each of those
objects uses Perl's overload feature to ``do the right
thing'' in different contexts. Optionally, a node may be
passed as a second argument to set the context node for
the query.
The xpath expression can be passed either as a string, or
as a XML::LibXML::XPathExpression object.
findvalue
$value = $xpc->findvalue($xpath )
$value = $xpc->findvalue($xpath, $context_node )
Is exactly equivalent to:
$xpc->find( $xpath, $context_node )->to_literal;
That is, it returns the literal value of the
results. This enables you to ensure that you get a string
back from your search, allowing certain shortcuts. This
could be used as the equivalent of <xsl:value-of
select=``some_xpath''/>. Optionally, a node may be
passed in the second argument to set the context node for
the query.
The xpath expression can be passed either as a string, or
as a XML::LibXML::XPathExpression object.
exists
$bool = $xpc->exists( $xpath_expression, $context_node );
This method behaves like findnodes, except
that it only returns a boolean value (1 if the expression matches a node, 0 otherwise)
and may be faster than findnodes, because
the XPath evaluation may stop early on the first match (this is true for libxml2 >= 2.6.27).
For XPath expressions that do not return node-set,
the method returns true if the returned value is a non-zero number or a non-empty string.
setContextNode
$xpc->setContextNode($node)
Set the current context node.
getContextNode
my $node = $xpc->getContextNode;
Get the current context node.
setContextPosition
$xpc->setContextPosition($position)
Set the current context position. By default, this
value is -1 (and evaluating XPath function
position() in the initial context
raises an XPath error), but can be set to any value up
to context size. This usually only serves to cheat the
XPath engine to return given position when
position() XPath function is
called. Setting this value to -1 restores the default
behavior.
getContextPosition
my $position = $xpc->getContextPosition;
Get the current context position.
setContextSize
$xpc->setContextSize($size)
Set the current context size. By default, this value is -1 (and
evaluating XPath function last() in
the initial context raises an XPath error), but can be
set to any non-negative value. This usually only serves
to cheat the XPath engine to return the given value when
last() XPath function is called. If
context size is set to 0, position is automatically also
set to 0. If context size is positive, position is
automatically set to 1. Setting context size to -1
restores the default behavior.
getContextSize
my $size = $xpc->getContextSize;
Get the current context size.
setContextNode
$xpc->setContextNode($node)
Set the current context node.
Bugs And Caveats
XML::LibXML::XPathContext objects
are reentrant, meaning that you can call
methods of an XML::LibXML::XPathContext even from XPath
extension functions registered with the same object or from a
variable lookup function. On the other hand, you should rather
avoid registering new extension functions, namespaces and a
variable lookup function from within extension functions and a
variable lookup function, unless you want to experience
untested behavior.
Authors
Ilya Martynov and Petr Pajas, based on
XML::LibXML and XML::LibXSLT code by Matt Sergeant and
Christian Glahn.
Historical remark
Prior to XML::LibXML 1.61 this module was distributed separately
for maintenance reasons.
XML::LibXML::Reader - interface to libxml2 pull parser
XML::LibXML::Reader
Synopsis
use XML::LibXML::Reader;
my $reader = XML::LibXML::Reader->new(location => "file.xml")
or die "cannot read file.xml\n";
while ($reader->read) {
processNode($reader);
}
sub processNode {
my $reader = shift;
printf "%d %d %s %d\n", ($reader->depth,
$reader->nodeType,
$reader->name,
$reader->isEmptyElement);
}
or
my $reader = XML::LibXML::Reader->new(location => "file.xml")
or die "cannot read file.xml\n";
$reader->preservePattern('//table/tr');
$reader->finish;
print $reader->document->toString(1);
DESCRIPTION
This is a perl interface to libxml2's pull-parser implementation
xmlTextReader
http://xmlsoft.org/html/libxml-xmlreader.html.
This feature requires at least libxml2-2.6.21.
Pull-parsers (such as StAX in Java, or XmlReader in C#) use an iterator
approach to parse XML documents. They are easier to program than
event-based parser (SAX) and much more lightweight than
tree-based parser (DOM), which load the complete tree into
memory.
The Reader acts as a cursor going forward on the document
stream and stopping at each node on the way. At every point,
the DOM-like methods of the Reader object allow one to examine the
current node (name, namespace, attributes, etc.)
The user's code keeps control of the progress and simply
calls the read() function repeatedly to
progress to the next node in the document order. Other
functions provide means for skipping complete sub-trees, or
nodes until a specific element, etc.
At every time, only a very limited portion of the
document is kept in the memory, which makes the API more
memory-efficient than using DOM. However, it is also possible
to mix Reader with DOM. At every point the user may copy the
current node (optionally expanded into a complete sub-tree)
from the processed document to another DOM tree, or to
instruct the Reader to collect sub-document in form of a DOM
tree consisting of selected nodes.
Reader API also supports namespaces, xml:base, entity
handling, and DTD validation. Schema and RelaxNG validation
support will probably be added in some later revision of the
Perl interface.
The naming of methods compared to libxml2 and C#
XmlTextReader has been changed slightly to match the
conventions of XML::LibXML. Some functions have been changed
or added with respect to the C interface.
CONSTRUCTOR
Depending on the XML source, the Reader object can be created with either of:
my $reader = XML::LibXML::Reader->new( location => "file.xml", ... );
my $reader = XML::LibXML::Reader->new( string => $xml_string, ... );
my $reader = XML::LibXML::Reader->new( IO => $file_handle, ... );
my $reader = XML::LibXML::Reader->new( FD => fileno(STDIN), ... );
my $reader = XML::LibXML::Reader->new( DOM => $dom, ... );
where ... are (optional) reader options described below in
or various parser options described in .
The constructor recognizes the following XML sources:
Source specification
location
Read XML from a local file or (non-HTTPS) URL.
string
Read XML from a string.
IO
Read XML a Perl IO filehandle.
FD
Read XML from a file descriptor (bypasses Perl I/O
layer, only applicable to filehandles for regular
files or pipes). Possibly faster than IO.
DOM
Use reader API to walk through a pre-parsed
.
Reader options
encoding => $encoding
override document encoding.
RelaxNG => $rng_schema
can be used to pass either a
object or a filename or (non-HTTPS) URL of a RelaxNG schema to the
constructor. The schema is then used to validate the
document as it is processed.
Schema => $xsd_schema
can be used to pass either a
object or a filename or (non-HTTPS) URL of a W3C XSD schema to the
constructor. The schema is then used to validate the
document as it is processed.
...
the reader further supports various
parser options described in
(specifically those
labeled by /reader/).
METHODS CONTROLLING PARSING PROGRESS
read ()
Moves the position to the next node in the stream,
exposing its properties.
Returns 1 if the node was read successfully, 0 if
there is no more nodes to read, or -1 in case of
error
readAttributeValue ()
Parses an attribute value into one or more Text and
EntityReference nodes.
Returns 1 in case of success, 0 if the reader was not positioned on an attribute node or all the attribute values have been read, or -1 in case of error.
readState ()
Gets the read state of the reader. Returns the state
value, or -1 in case of error. The module exports
constants for the Reader states, see STATES
below.
depth ()
The depth of the node in the tree, starts at 0 for
the root node.
next ()
Skip to the node following the current one in the
document order while avoiding the sub-tree if any.
Returns 1 if the node was read successfully, 0 if there
is no more nodes to read, or -1 in case of error.
nextElement (localname?,nsURI?)
Skip nodes following the current one in the document
order until a specific element is reached. The element's
name must be equal to a given localname if defined, and
its namespace must equal to a given nsURI if defined.
Either of the arguments can be undefined (or omitted, in
case of the latter or both).
Returns 1 if the element was found, 0 if there is no more nodes to read, or -1 in case of error.
nextPatternMatch (compiled_pattern)
Skip nodes following the current one in the document
order until an element matching a given
compiled pattern is reached. See
for information on
compiled patterns. See also the matchesPattern
method.
Returns 1 if the element was found, 0 if there is no more nodes to read, or -1 in case of error.
skipSiblings ()
Skip all nodes on the same or lower level until the
first node on a higher level is reached. In particular,
if the current node occurs in an element, the reader
stops at the end tag of the parent element, otherwise it
stops at a node immediately following the parent
node.
Returns 1 if successful, 0 if end of the document is reached, or -1 in case of error.
nextSibling ()
It skips to the node following the current one in
the document order while avoiding the sub-tree if
any.
Returns 1 if the node was read successfully, 0 if
there is no more nodes to read, or -1 in case of
error
nextSiblingElement (name?,nsURI?)
Like nextElement but only processes sibling elements
of the current node (moving forward using
nextSibling () rather than
read (), internally).
Returns 1 if the element was found, 0 if there is no
more sibling nodes, or -1 in case of error.
finish ()
Skip all remaining nodes in the document, reaching end of the document.
Returns 1 if successful, 0 in case of error.
close ()
This method releases any resources allocated by the
current instance and closes any underlying input. It
returns 0 on failure and 1 on success. This method is
automatically called by the destructor when the reader
is forgotten, therefore you do not have to call it
directly.
METHODS EXTRACTING INFORMATION
name ()
Returns the qualified name of the current node, equal to (Prefix:)LocalName.
nodeType ()
Returns the type of the current node. See NODE TYPES below.
localName ()
Returns the local name of the node.
prefix ()
Returns the prefix of the namespace associated with the node.
namespaceURI ()
Returns the URI defining the namespace associated with the node.
isEmptyElement ()
Check if the current node is empty, this is a bit
bizarre in the sense that <a/> will be considered
empty while <a></a> will not.
hasValue ()
Returns true if the node can have a text value.
value ()
Provides the text value of the node if present or undef if not available.
readInnerXml ()
Reads the contents of the current node, including
child nodes and markup. Returns a string containing the
XML of the node's content, or undef if the current node
is neither an element nor attribute, or has no child
nodes.
readOuterXml ()
Reads the contents of the current node, including
child nodes and markup.
Returns a string containing the XML of the node
including its content, or undef if the current node is
neither an element nor attribute.
nodePath()
Returns a canonical location path to the current element
from the root node to the current node. Namespaced
elements are matched by '*', because there is no way to declare
prefixes within XPath patterns. Unlike
XML::LibXML::Node::nodePath(), this function
does not provide sibling counts (i.e. instead of e.g. '/a/b[1]' and '/a/b[2]'
you get '/a/b' for both matches).
matchesPattern(compiled_pattern)
Returns a true value if the current
node matches a compiled pattern.
See for information on
compiled patterns. See also the nextPatternMatch
method.
METHODS EXTRACTING DOM NODES
document ()
Provides access to the document tree built by the
reader. This function can be used to collect the
preserved nodes (see preserveNode()
and preservePattern).
CAUTION: Never use this function to modify the tree
unless reading of the whole document is
completed!
copyCurrentNode (deep)
This function is similar a DOM function
copyNode(). It returns a copy of the
currently processed node as a corresponding DOM object.
Use deep = 1 to obtain the full sub-tree.
preserveNode ()
This tells the XML Reader to preserve the current
node in the document tree. A document tree consisting of
the preserved nodes and their content can be obtained
using the method document() once
parsing is finished.
Returns the node or NULL in case of error.
preservePattern (pattern,\%ns_map)
This tells the XML Reader to preserve all nodes
matched by the pattern (which is a streaming XPath
subset). A document tree consisting of the preserved
nodes and their content can be obtained using the method
document() once parsing is
finished.
An optional second argument can be used to provide a
HASH reference mapping prefixes used by the XPath to
namespace URIs.
The XPath subset available with this function is
described at
http://www.w3.org/TR/xmlschema-1/#Selector
and matches the production
Path ::= ('.//')? ( Step '/' )* ( Step | '@' NameTest )
Returns a positive number in case of success and -1
in case of error
METHODS PROCESSING ATTRIBUTES
attributeCount ()
Provides the number of attributes of the current
node.
hasAttributes ()
Whether the node has attributes.
getAttribute (name)
Provides the value of the attribute with the
specified qualified name.
Returns a string containing the value of the
specified attribute, or undef in case of error.
getAttributeNs (localName, namespaceURI)
Provides the value of the specified
attribute.
Returns a string containing the value of the
specified attribute, or undef in case of error.
getAttributeNo (no)
Provides the value of the attribute with the
specified index relative to the containing
element.
Returns a string containing the value of the
specified attribute, or undef in case of error.
isDefault ()
Returns true if the current attribute node was
generated from the default value defined in the
DTD.
moveToAttribute (name)
Moves the position to the attribute with the
specified local name and namespace URI.
Returns 1 in case of success, -1 in case of error, 0
if not found
moveToAttributeNo (no)
Moves the position to the attribute with the
specified index relative to the containing
element.
Returns 1 in case of success, -1 in case of error, 0
if not found
moveToAttributeNs (localName,namespaceURI)
Moves the position to the attribute with the
specified local name and namespace URI.
Returns 1 in case of success, -1 in case of error, 0
if not found
moveToFirstAttribute ()
Moves the position to the first attribute associated
with the current node.
Returns 1 in case of success, -1 in case of error, 0
if not found
moveToNextAttribute ()
Moves the position to the next attribute associated
with the current node.
Returns 1 in case of success, -1 in case of error, 0
if not found
moveToElement ()
Moves the position to the node that contains the
current attribute node.
Returns 1 in case of success, -1 in case of error, 0
if not moved
isNamespaceDecl ()
Determine whether the current node is a namespace
declaration rather than a regular attribute.
Returns 1 if the current node is a namespace
declaration, 0 if it is a regular attribute or other
type of node, or -1 in case of error.
OTHER METHODS
lookupNamespace (prefix)
Resolves a namespace prefix in the scope of the
current element.
Returns a string containing the namespace URI to
which the prefix maps or undef in case of error.
encoding ()
Returns a string containing the encoding of the
document or undef in case of error.
standalone ()
Determine the standalone status of the document
being read. Returns 1 if the document was declared to be
standalone, 0 if it was declared to be not standalone,
or -1 if the document did not specify its standalone
status or in case of error.
xmlVersion ()
Determine the XML version of the document being
read. Returns a string containing the XML version of the
document or undef in case of error.
baseURI ()
Returns the base URI of a given node.
isValid ()
Retrieve the validity status from the parser.
Returns 1 if valid, 0 if no, and -1 in case of
error.
xmlLang ()
The xml:lang scope within which the node
resides.
lineNumber ()
Provide the line number of the current parsing
point.
columnNumber ()
Provide the column number of the current parsing
point.
byteConsumed ()
This function provides the current index of the
parser relative to the start of the current entity. This
function is computed in bytes from the beginning
starting at zero and finishing at the size in bytes of
the file if parsing a file. The function is of constant
cost if the input is UTF-8 but can be costly if run on
non-UTF-8 input.
setParserProp (prop => value, ...)
Change the parser processing behaviour by changing
some of its internal properties. The following
properties are available with this function:
``load_ext_dtd'', ``complete_attributes'',
``validation'', ``expand_entities''.
Since some of the properties can only be changed
before any read has been done, it is best to set the
parsing properties at the constructor.
Returns 0 if the call was successful, or -1 in case
of error
getParserProp (prop)
Get value of an parser internal property. The
following property names can be used: ``load_ext_dtd'',
``complete_attributes'', ``validation'',
``expand_entities''.
Returns the value, usually 0 or 1, or -1 in case of
error.
DESTRUCTION
XML::LibXML takes care of the reader object destruction
when the last reference to the reader object goes out of
scope. The document tree is preserved, though, if either of
$reader->document or $reader->preserveNode was used and
references to the document tree exist.
NODE TYPES
The reader interface provides the following constants for
node types (the constant symbols are exported by default or if
tag :types is used).
XML_READER_TYPE_NONE => 0
XML_READER_TYPE_ELEMENT => 1
XML_READER_TYPE_ATTRIBUTE => 2
XML_READER_TYPE_TEXT => 3
XML_READER_TYPE_CDATA => 4
XML_READER_TYPE_ENTITY_REFERENCE => 5
XML_READER_TYPE_ENTITY => 6
XML_READER_TYPE_PROCESSING_INSTRUCTION => 7
XML_READER_TYPE_COMMENT => 8
XML_READER_TYPE_DOCUMENT => 9
XML_READER_TYPE_DOCUMENT_TYPE => 10
XML_READER_TYPE_DOCUMENT_FRAGMENT => 11
XML_READER_TYPE_NOTATION => 12
XML_READER_TYPE_WHITESPACE => 13
XML_READER_TYPE_SIGNIFICANT_WHITESPACE => 14
XML_READER_TYPE_END_ELEMENT => 15
XML_READER_TYPE_END_ENTITY => 16
XML_READER_TYPE_XML_DECLARATION => 17
STATES
The following constants represent the values returned by
readState(). They are exported by default,
or if tag :states is used:
XML_READER_NONE => -1
XML_READER_START => 0
XML_READER_ELEMENT => 1
XML_READER_END => 2
XML_READER_EMPTY => 3
XML_READER_BACKTRACK => 4
XML_READER_DONE => 5
XML_READER_ERROR => 6
SEE ALSO
for information about compiled patterns.
http://xmlsoft.org/html/libxml-xmlreader.html
http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html
ORIGINAL IMPLEMENTATION
Heiko Klein, <H.Klein@gmx.net<gt> and Petr Pajas
XML::LibXML::XPathExpression - interface to libxml2 pre-compiled XPath expressions
XML::LibXML::XPathExpression
Synopsis
use XML::LibXML;
my $compiled_xpath = XML::LibXML::XPathExpression->new('//foo[@bar="baz"][position()<4]');
# interface from XML::LibXML::Node
my $result = $node->find($compiled_xpath);
my @nodes = $node->findnodes($compiled_xpath);
my $value = $node->findvalue($compiled_xpath);
# interface from XML::LibXML::XPathContext
my $result = $xpc->find($compiled_xpath,$node);
my @nodes = $xpc->findnodes($compiled_xpath,$node);
my $value = $xpc->findvalue($compiled_xpath,$node);
Description
This is a perl interface to libxml2's pre-compiled XPath expressions.
Pre-compiling an XPath expression can give in some performance
benefit if the same XPath query is evaluated many times.
XML::LibXML::XPathExpression objects
can be passed to all find...
functions XML::LibXML
that expect an XPath expression.
new()
$compiled = XML::LibXML::XPathExpression->new( xpath_string );
The constructor takes an XPath 1.0 expression as a string
and returns an object representing the pre-compiled
expressions (the actual data structure is internal to libxml2).
XML::LibXML::Pattern - interface to libxml2 XPath patterns
XML::LibXML::Pattern
Synopsis
use XML::LibXML;
my $pattern = XML::LibXML::Pattern->new('/x:html/x:body//x:div', { 'x' => 'http://www.w3.org/1999/xhtml' });
# test a match on an XML::LibXML::Node $node
if ($pattern->matchesNode($node)) { ... }
# or on an XML::LibXML::Reader
if ($reader->matchesPattern($pattern)) { ... }
# or skip reading all nodes that do not match
print $reader->nodePath while $reader->nextPatternMatch($pattern);
Description
This is a perl interface to libxml2's pattern matching support
http://xmlsoft.org/html/libxml-pattern.html.
This feature requires recent versions of libxml2.
Patterns are a small subset of XPath language, which is limited
to (disjunctions of) location paths involving the child and descendant axes in abbreviated form
as described by the extended BNF given below:
Selector ::= Path ( '|' Path )*
Path ::= ('.//' | '//' | '/' )? Step ( '/' Step )*
Step ::= '.' | NameTest
NameTest ::= QName | '*' | NCName ':' '*'
For readability, whitespace may be used in selector XPath expressions even though not explicitly allowed by the grammar: whitespace may be freely added within patterns before or after any token, where
token ::= '.' | '/' | '//' | '|' | NameTest
Note that no predicates or attribute tests are allowed.
Patterns are particularly useful for stream parsing provided via the XML::LibXML::Reader interface.
new()
$pattern = XML::LibXML::Pattern->new( pattern, { prefix => namespace_URI, ... } );
The constructor of a pattern takes a pattern expression (as described
by the BNF grammar above) and an optional HASH reference mapping
prefixes to namespace URIs. The method returns a compiled pattern object.
Note that if the document
has a default namespace, it must still be given an prefix in order
to be matched (as demanded by the XPath 1.0 specification). For example,
to match an element <a xmlns="http://foo.bar"</a>, one
should use a pattern like this:
$pattern = XML::LibXML::Pattern->new( 'foo:a', { foo => 'http://foo.bar' });
matchesNode($node)
$bool = $pattern->matchesNode($node);
Given an XML::LibXML::Node object, returns a true value if
the node is matched by the compiled pattern expression.
SEE ALSO
for other methods involving compiled patterns.
XML::LibXML::RegExp - interface to libxml2 regular expressions
XML::LibXML::RegExp
Synopsis
use XML::LibXML;
my $compiled_re = XML::LibXML::RegExp->new('[0-9]{5}(-[0-9]{4})?');
if ($compiled_re->isDeterministic()) { ... }
if ($compiled_re->matches($string)) { ... }
Description
This is a perl interface to libxml2's implementation of regular expressions, which are used e.g. for validation of XML Schema simple types (pattern facet).
new()
$compiled_re = XML::LibXML::RegExp->new( $regexp_str );
The constructor takes a string containing a regular expression
and returns a compiled regexp object.
matches($string)
$bool = $compiled_re->matches($string);
Given a string value, returns a true value if
the value is matched by the compiled regular expression.
isDeterministic()
$bool = $compiled_re->isDeterministic();
Returns a true value if the regular expression is deterministic; returns false otherwise. (See the definition of determinism in the XML spec)
A map for named nodes
XML::LibXML::NamedNodeMap
Synopsis
use XML::LibXML;
my $map = XML::LibXML::NamedNodeMap->new(@nodes);
my $nodes_list = $map->nodes();
my $node_with_index_2 = $map->item(2);
Description
XML::LibXML::NamedNodeMap maps nodes' names to nodes.
Methods
length
my $length = $map->length;
Returns the number of nodes in the map.
nodes
my $nodes_ref = $node->nodes()
Returns a reference to the list of nodes.
item
my $node_2 = $map->item(2);
Returns the node with the index of the argument
(starting from 0)
getNamedItem
my $node = $map->getNamedItem('phone_number');
Returns the node with the name.
setNamedItem
$map->setNamedItem($new_node)
Sets the node with the same name as
$new_node to
$new_node.
removeNamedItem
$map->removeNamedItem($name)
Remove the item with the name
$name.
getNamedItemNS
Not implemented yet..
setNamedItemNS
Not implemented yet..
removeNamedItemNS
Not implemented yet..
Structured Errors
XML::LibXML::Error
Synopsis
eval { ... };
if (ref($@)) {
# handle a structured error (XML::LibXML::Error object)
} elsif ($@) {
# error, but not an XML::LibXML::Error object
} else {
# no error
}
Description
The
XML::LibXML::Error class is a tiny frontend to
libxml2's structured error support. If
XML::LibXML is compiled with structured error support, all errors
reported by libxml2 are transformed to XML::LibXML::Error
objects. These objects automatically serialize to the
corresponding error messages when printed or used in a string
operation, but as objects, can also be used to get a detailed and
structured information about the error that occurred.
Unlike most other XML::LibXML objects, XML::LibXML::Error
doesn't wrap an underlying libxml2
structure directly, but rather transforms it to a blessed Perl
hash reference containing the individual fields of the
structured error information as hash key-value pairs. Individual
items (fields) of a structured error can either be
obtained directly as $@->{field}, or using autoloaded
methods such as $@->field() (where field is the field
name). XML::LibXML::Error objects have the following fields:
domain, code, level, file, line, nodename, message, str1, str2,
str3, num1, num2, and _prev (some of them may be undefined).
$XML::LibXML::Error::WARNINGS
$XML::LibXML::Error::WARNINGS=1;
Traditionally, XML::LibXML was suppressing parser
warnings by setting libxml2's global variable
xmlGetWarningsDefaultValue to 0. Since
1.70 we do not change libxml2's global
variables anymore; for backward compatibility,
XML::LibXML suppresses warnings.
This variable can be set to 1
to enable reporting of these warnings via
Perl warn
and to 2 to report hem via die.
as_string
$message = $@->as_string();
This function serializes an XML::LibXML::Error
object to a string containing the full error message
close to the message produced by libxml2 default error
handlers and tools like xmllint. This method is also used
to overload "" operator on XML::LibXML::Error, so it is
automatically called whenever XML::LibXML::Error object
is treated as a string (e.g. in print $@).
dump
print $@->dump();
This function serializes an XML::LibXML::Error to a
string displaying all fields of the error structure
individually on separate lines of the form 'name' => 'value'.
domain
$error_domain = $@->domain();
Returns string containing information about what part
of the library raised the error. Can be one of:
"parser", "tree", "namespace", "validity", "HTML parser",
"memory", "output", "I/O", "ftp", "http",
"XInclude", "XPath", "xpointer", "regexp", "Schemas
datatype",
"Schemas parser", "Schemas validity",
"Relax-NG parser", "Relax-NG validity", "Catalog",
"C14N", "XSLT", "validity".
code
$error_code = $@->code();
Returns the actual libxml2 error code.
The XML::LibXML::ErrNo module defines
constants for individual error codes. Currently
libxml2 uses over 480 different error codes.
message
$error_message = $@->message();
Returns a human-readable informative error
message.
level
$error_level = $@->level();
Returns an integer value describing how consequent is
the error. XML::LibXML::Error defines the following
constants:
XML_ERR_NONE = 0
XML_ERR_WARNING = 1 : A simple warning.
XML_ERR_ERROR = 2 : A recoverable error.
XML_ERR_FATAL = 3 : A fatal error.
file
$filename = $@->file();
Returns the filename of the file being processed while
the error occurred.
line
$line = $@->line();
The line number, if available.
nodename
$nodename = $@->nodename();
Name of the node where error occurred, if available.
When this field is non-empty, libxml2 actually returned a
physical pointer to the specified node. Due to memory
management issues, it is very difficult to implement a
way to expose the pointer to the Perl level as a
XML::LibXML::Node. For this reason, XML::LibXML::Error
currently only exposes the name the node.
str1
$error_str1 = $@->str1();
Error specific. Extra string information.
str2
$error_str2 = $@->str2();
Error specific. Extra string information.
str3
$error_str3 = $@->str3();
Error specific. Extra string information.
num1
$error_num1 = $@->num1();
Error specific. Extra numeric information.
num2
$error_num2 = $@->num2();
In recent libxml2 versions, this
value contains a column number of the error or 0 if N/A.
context
$string = $@->context();
For parsing errors, this field contains
about 80 characters of the XML near the place
where the error occurred. The field
$@->column()
contains the corresponding offset.
Where N/A, the field is undefined.
column
$offset = $@->column();
See $@->column() above.
_prev
$previous_error = $@->_prev();
This field can possibly hold a reference to another
XML::LibXML::Error object representing an error which
occurred just before this error.
Structured Errors
XML::LibXML::ErrNo
Description
This module is based on xmlerror.h libxml2 C header file.
It defines symbolic constants for all libxml2 error codes.
Currently libxml2 uses over 480 different error codes.
See also XML::LibXML::Error.
Constants and Character Encoding Routines
XML::LibXML::Common
Synopsis
use XML::LibXML::Common;
Description
XML::LibXML::Common defines constants for all node types
and provides interface to libxml2 charset conversion
functions.
Since XML::LibXML use their own node type definitions,
one may want to use XML::LibXML::Common in its compatibility
mode:
Exporter TAGS
use XML::LibXML::Common qw(:libxml);
:libxml tag will use the XML::LibXML Compatibility mode, which defines the
old 'XML_' node-type definitions.
use XML::LibXML::Common qw(:gdome);
:gdome tag will use the XML::GDOME Compatibility mode, which defines the
old 'GDOME_' node-type definitions.
use XML::LibXML::Common qw(:w3c);
This uses the nodetype definition names as specified for DOM.
use XML::LibXML::Common qw(:encoding);
This tag can be used to export only the charset encoding functions of XML::LibXML::Common.
Exports
By default the W3 definitions as defined in the DOM specifications and
the encoding functions are exported by XML::LibXML::Common.
Encoding functions
To encode or decode a string to or from UTF-8, XML::LibXML::Common exports
two functions, which provide an interface to the encoding support in libxml2.
Which encodings are supported by these functions depends
on how libxml2 was compiled. UTF-16 is
always supported and on most installations, ISO encodings are
supported as well.
This interface was useful for older versions of Perl.
Since Perl >= 5.8 provides similar functions via the Encode module,
it is probably a good idea to use those instead.
encodeToUTF8
$encodedstring = encodeToUTF8( $name_of_encoding, $sting_to_encode );
The function will convert a byte string from the specified encoding to an UTF-8 encoded character string.
decodeToUTF8
$decodedstring = decodeFromUTF8($name_of_encoding, $string_to_decode );
This function converts an UTF-8 encoded character string to a specified
encoding. Note that the conversion can raise an error if the
given string contains characters that cannot be represented in the target encoding.
Both these functions report their errors on the standard
error. If an error occurs the function will croak(). To catch
the error information it is required to call the encoding
function from within an eval block in order to prevent the
entire script from being stopped on encoding error.
A note on history
Before XML::LibXML 1.70, this class was available as a
separate CPAN distribution, intended to provide functionality
shared between XML::LibXML, XML::GDOME, and possibly other
modules. Since there seems to be no progress in this
direction, we decided to merge XML::LibXML::Common 0.13 and
XML::LibXML 1.70 to one CPAN distribution.
The merge also naturally eliminates a practical and
urgent problem experienced by many XML::LibXML users on certain
platforms, namely mysterious misbehavior of XML::LibXML
occurring if the installed (often pre-packaged) version of
XML::LibXML::Common was compiled against an older version of
libxml2 than XML::LibXML.