XML::LibXML Matt Sergeant Christian Glahn Petr Pajas 2.0207 2001-2007 AxKit.com Ltd 2002-2006 Christian Glahn 2006-2009 Petr Pajas Introduction README This module implements a Perl interface to the Gnome libxml2 library which provides interfaces for parsing and manipulating XML files. This module allows Perl programmers to make use of its highly capable validating XML parser and its high performance DOM implementation. Important Notes XML::LibXML was almost entirely reimplemented between version 1.40 to version 1.49. This may cause problems on some production machines. With version 1.50 a lot of compatibility fixes were applied, so programs written for XML::LibXML 1.40 or less should run with version 1.50 again. In 1.59, a new callback API was introduced. This new API is not compatible with the previous one. See XML::LibXML::InputCallback manual page for details. In 1.61 the XML::LibXML::XPathContext module, previously distributed separately, was merged in. An experimental support for Perl threads introduced in 1.66 has been replaced in 1.67. Dependencies Prior to installation you MUST have installed the libxml2 library. You can get the latest libxml2 version from http://xmlsoft.org/ Without libxml2 installed this module will neither build nor run. Also XML::LibXML requires the following packages: XML::SAX - base class for SAX parsers XML::NamespaceSupport - namespace support for SAX parsers These packages are required. If one is missing some tests will fail. Again, libxml2 is required to make XML::LibXML work. The library is not just required to build XML::LibXML, it has to be accessible during run-time as well. Because of this you need to make sure libxml2 is installed properly. To test this, run the xmllint program on your system. xmllint is shipped with libxml2 and therefore should be available. For building the module you will also need the header file for libxml2, which in binary (.rpm,.deb) etc. distributions usually dwell in a package named libxml2-devel or similar. Installation (These instructions are for UNIX and GNU/Linux systems. For MSWin32, See Notes for Microsoft Windows below.) To install XML::LibXML just follow the standard installation routine for Perl modules: perl Makefile.PL make make test make install # as superuser Note that XML::LibXML is an XS based Perl extension and you need a C compiler to build it. Note also that you should rebuild XML::LibXML if you upgrade libxml2 in order to avoid problems with possible binary incompatibilities between releases of the library. Notes on libxml2 versions XML::LibXML requires at least libxml2 2.6.16 to compile and pass all tests and at least 2.6.21 is required for XML::LibXML::Reader. For some older OS versions this means that an update of the pre-built packages is required. Although libxml2 claims binary compatibility between its patch levels, it is a good idea to recompile XML::LibXML and run its tests after an upgrade of libxml2. If your libxml2 installation is not within your $PATH, you can pass the XMLPREFIX=$YOURLIBXMLPREFIX parameter to Makefile.PL determining the correct libxml2 version in use. e.g. perl Makefile.PL XMLPREFIX=/usr/brand-new will ask '/usr/brand-new/bin/xml2-config' about your real libxml2 configuration. Try to avoid setting INC and LIBS directly on the command-line, for if used, Makefile.PL does not check the libxml2 version for compatibility with XML::LibXML. Which version of libxml2 should be used? XML::LibXML is tested against a couple versions of libxml2 before it is released. Thus there are versions of libxml2 that are known not to work properly with XML::LibXML. The Makefile.PL keeps a blacklist of the incompatible libxml2 versions using Alien::Libxml2. The blacklist itself is kept inside its "alienfile" file. If Makefile.PL detects one of the incompatible versions, it notifies the user. It may still happen that XML::LibXML builds and pass its tests with such a version, but that does not mean everything is OK. There will be no support at all for blacklisted versions! As of XML::LibXML 1.61, only versions 2.6.16 and higher are supported. XML::LibXML will probably not compile with earlier libxml2 versions than 2.5.6. Versions prior to 2.6.8 are known to be broken for various reasons, versions prior to 2.1.16 exhibit problems with namespaced attributes and do not therefore pass XML::LibXML regression tests. It may happen that an unsupported version of libxml2 passes all tests under certain conditions. This is no reason to assume that it shall work without problems. If Makefile.PL marks a version of libxml2 as incompatible or broken it is done for a good reason. Full linking information for libxml2 can be obtained by invoking "xml2-config --libs". Notes for Microsoft Windows Thanks to Randy Kobes there is a pre-compiled PPM package available on http://theoryx5.uwinnipeg.ca/ppmpackages/ Usually it takes a little time to build the package for the latest release. If you want to build XML::LibXML on Windows from source, you can use the following instructions contributed by Christopher J. Madsen: These instructions assume that you already have your system set up to compile modules that use C components. First, get the libxml2 binaries from http://xmlsoft.org/sources/win32/ (currently also available at http://www.zlatkovic.com/pub/libxml/). You need: iconv-VERSION.win32.zip libxml2-VERSION.win32.zip zlib-VERSION.win32.zip Download the latest version of each. (Each package will probably have a different version.) When you extract them, you'll get directories named iconv-VERSION.win32, libxml2-VERSION.win32, and zlib-VERSION.win32, each containing bin, lib, and include directories. Combine all the bin, include, and lib directories under c:\Prog\LibXML. (You can use any directory you prefer; just adjust the instructions accordingly.) Get the latest version of XML-LibXML from CPAN. Extract it. Issue these commands in the XML-LibXML-VERSION directory: perl Makefile.PL INC=-Ic:\Prog\LibXML\include LIBS=-Lc:\Prog\LibXML\lib nmake copy c:\Prog\LibXML\bin\*.dll blib\arch\auto\XML\LibXML nmake test nmake install (Note: Some systems use dmake instead of nmake.) By copying the libxml2 DLLs to the arch directory, you help avoid conflicts with other programs you may have installed that use other (possibly incompatible) versions of those DLLs. Notes for Mac OS X Due to a refactoring of the module, XML::LibXML will not run with some earlier versions of Mac OS X. It appears that this is related to special linker options for that OS prior to version 10.2.2. Since the developers do not have full access to this OS, help/ patches from OS X gurus are highly appreciated. It is confirmed that XML::LibXML builds and runs without problems since Mac OS X 10.2.6. Notes for HPUX XML::LibXML requires libxml2 2.6.16 or later. There may not exist a usable binary libxml2 package for HPUX and XML::LibXML. If HPUX cc does not compile libxml2 correctly, you will be forced to recompile perl with gcc (unless you have already done that). Additionally I received the following Note from Rozi Kovesdi: Here is my report if someone else runs into the same problem: Finally I am done with installing all the libraries and XML Perl modules The combination that worked best for me was: gcc GNU make Most importantly - before trying to install Perl modules that depend on libxml2: must set SHLIB_PATH to include the path to libxml2 shared library assuming that you used the default: export SHLIB=/usr/local/lib also, make sure that the config files have execute permission: /usr/local/bin/xml2-config /usr/local/bin/xslt-config they did not have +x after they were installed by 'make install' and it took me a while to realize that this was my problem or one can use: perl Makefile.PL LIBS='-L/path/to/lib' INC='-I/path/to/include' Contact For bug reports, please use the issue tracker at https://github.com/shlomif/perl-XML-LibXML/issues . For suggestions etc. you may contact the maintainer directly at https://www.shlomifish.org/me/contact-me/ , but in general, it is recommended to use the mailing list given below. For suggestions etc., and other issues related to XML::LibXML you may use the perl XML mailing list (perl-xml@listserv.ActiveState.com), where most XML-related Perl modules are discussed. In case of problems you should check the archives of that list first. Many problems are already discussed there. You can find the list's archives and subscription options at http://aspn.activestate.com/ASPN/Mail/Browse/Threaded/perl-xml Package History Version < 0.98 were maintained by Matt Sergeant 0.98 > Version > 1.49 were maintained by Matt Sergeant and Christian Glahn Versions >= 1.49 are maintained by Christian Glahn Versions > 1.56 are co-maintained by Petr Pajas Versions >= 1.59 are provisionally maintained by Petr Pajas Patches and Developer Version As XML::LibXML is open source software, help and patches are appreciated. If you find a bug in the current release, make sure this bug still exists in the developer version of XML::LibXML. This version can be cloned from its Git repository. For more information about that, see: https://github.com/shlomif/perl-XML-LibXML Please consider all regression tests as correct. If any test fails it is most certainly related to a bug. If you find documentation bugs, please fix them in the libxml.dbk file, stored in the docs directory. Known Issues The push-parser implementation causes memory leaks. License LICENSE This is free software, you may use it and distribute it under the same terms as Perl itself. Copyright 2001-2003 AxKit.com Ltd., 2002-2006 Christian Glahn, 2006-2009 Petr Pajas Disclaimer THIS PROGRAM IS DISTRIBUTED IN THE HOPE THAT IT WILL BE USEFUL, BUT WITHOUT ANY WARRANTY; WITHOUT EVEN THE IMPLIED WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Perl Binding for libxml2 XML::LibXML Synopsis use XML::LibXML; my $dom = XML::LibXML->load_xml(string => <<'EOT'); <some-xml/> EOT Description This module is an interface to libxml2, providing XML and HTML parsers with DOM, SAX and XMLReader interfaces, a large subset of DOM Layer 3 interface and a XML::XPath-like interface to XPath API of libxml2. The module is split into several packages which are not described in this section; unless stated otherwise, you only need to use XML::LibXML; in your programs. Check out XML::LibXML by Example for a tutorial. For further information, please check the following documentation: Parsing XML files with XML::LibXML XML::LibXML Document Object Model (DOM) Implementation XML::LibXML direct SAX parser Reading XML with a pull-parser XML::LibXML frontend for DTD validation XML::LibXML frontend for RelaxNG schema validation XML::LibXML frontend for W3C Schema schema validation API for evaluating XPath expressions with enhanced support for the evaluation context Implementing custom URI Resolver and input callbacks Common functions for XML::LibXML related Classes The nodes in the Document Object Model (DOM) are represented by the following classes (most of which "inherit" from ): XML::LibXML class for DOM document nodes Abstract base class for XML::LibXML DOM nodes XML::LibXML class for DOM element nodes XML::LibXML class for DOM text nodes XML::LibXML class for comment DOM nodes XML::LibXML class for DOM CDATA sections XML::LibXML DOM attribute class XML::LibXML's DOM L2 Document Fragment implementation XML::LibXML DOM namespace nodes XML::LibXML DOM processing instruction nodes Encodings support in XML::LibXML Recall that since version 5.6.1, Perl distinguishes between character strings (internally encoded in UTF-8) and so called binary data and, accordingly, applies either character or byte semantics to them. A scalar representing a character string is distinguished from a byte string by special flag (UTF8). Please refer to perlunicode for details. XML::LibXML's API is designed to deal with many encodings of XML documents completely transparently, so that the application using XML::LibXML can be completely ignorant about the encoding of the XML documents it works with. On the other hand, functions like XML::LibXML::Document->setEncoding give the user control over the document encoding. To ensure the aforementioned transparency and uniformity, most functions of XML::LibXML that work with in-memory trees accept and return data as character strings (i.e. UTF-8 encoded with the UTF8 flag on) regardless of the original document encoding; however, the functions related to I/O operations (i.e. parsing and saving) operate with binary data (in the original document encoding) obeying the encoding declaration of the XML documents. Below we summarize basic rules and principles regarding encoding: Do NOT apply any encoding-related PerlIO layers (:utf8 or :encoding(...)) to file handles that are an input for the parses or an output for a serializer of (full) XML documents. This is because the conversion of the data to/from the internal character representation is provided by libxml2 itself which must be able to enforce the encoding specified by the <?xml version="1.0" encoding="..."?> declaration. Here is an example to follow: use XML::LibXML; # load open my $fh, '<', 'file.xml'; binmode $fh; # drop all PerlIO layers possibly created by a use open pragma $doc = XML::LibXML->load_xml(IO => $fh); # save open my $out, '>', 'out.xml'; binmode $out; # as above $doc->toFH($out); # or print {$out} $doc->toString(); All functions working with DOM accept and return character strings (UTF-8 encoded with UTF8 flag on). E.g. new('1.0',$some_encoding); my $element = $doc->createElement($name); $element->appendText($text); $xml_fragment = $element->toString(); # returns a character string $xml_document = $doc->toString(); # returns a byte string ]]> where $some_encoding is the document encoding that will be used when saving the document, and $name and $text contain character strings (UTF-8 encoded with UTF8 flag on). Note that the method toString returns XML as a character string if applied to other node than the Document node and a byte string containing the appropriate <?xml version="1.0" encoding="..."?> declaration if applied to a . DOM methods also accept binary strings in the original encoding of the document to which the node belongs (UTF-8 is assumed if the node is not attached to any document). Exploiting this feature is NOT RECOMMENDED since it is considered bad practice. new('1.0','iso-8859-2'); my $text = $doc->createTextNode($some_latin2_encoded_byte_string); # WORKS, BUT NOT RECOMMENDED! ]]> NOTE: libxml2 support for many encodings is based on the iconv library. The actual list of supported encodings may vary from platform to platform. To test if your platform works correctly with your language encoding, build a simple document in the particular encoding and try to parse it with XML::LibXML to see if the parser produces any errors. Occasional crashes were reported on rare platforms that ship with a broken version of iconv. Thread Support XML::LibXML since 1.67 partially supports Perl threads in Perl >= 5.8.8. XML::LibXML can be used with threads in two ways: By default, all XML::LibXML classes use CLONE_SKIP class method to prevent Perl from copying XML::LibXML::* objects when a new thread is spawn. In this mode, all XML::LibXML::* objects are thread specific. This is the safest way to work with XML::LibXML in threads. Alternatively, one may use use threads; use XML::LibXML qw(:threads_shared); to indicate, that all XML::LibXML node and parser objects should be shared between the main thread and any thread spawn from there. For example, in my $doc = XML::LibXML->load_xml(location => $filename); my $thr = threads->new(sub{ # code working with $doc 1; }); $thr->join; the variable $doc refers to the exact same XML::LibXML::Document in the spawned thread as in the main thread. Without using mutex locks, parallel threads may read the same document (i.e. any node that belongs to the document), parse files, and modify different documents. However, if there is a chance that some of the threads will attempt to modify a document (or even create new nodes based on that document, e.g. with $doc->createElement) that other threads may be reading at the same time, the user is responsible for creating a mutex lock and using it in both in the thread that modifies and the thread that reads: my $doc = XML::LibXML->load_xml(location => $filename); my $mutex : shared; my $thr = threads->new(sub{ lock $mutex; my $el = $doc->createElement('foo'); # ... 1; }); { lock $mutex; my $root = $doc->documentElement; say $root->name; } $thr->join; Note that libxml2 uses dictionaries to store short strings and these dictionaries are kept on a document node. Without mutex locks, it could happen in the previous example that the thread modifies the dictionary while other threads attempt to read from it, which could easily lead to a crash. Version Information Sometimes it is useful to figure out, for which version XML::LibXML was compiled for. In most cases this is for debugging or to check if a given installation meets all functionality for the package. The functions XML::LibXML::LIBXML_DOTTED_VERSION and XML::LibXML::LIBXML_VERSION provide this version information. Both functions simply pass through the values of the similar named macros of libxml2. Similarly, XML::LibXML::LIBXML_RUNTIME_VERSION returns the version of the (usually dynamically) linked libxml2. XML::LibXML::LIBXML_DOTTED_VERSION $Version_String = XML::LibXML::LIBXML_DOTTED_VERSION; Returns the version string of the libxml2 version XML::LibXML was compiled for. This will be "2.6.2" for "libxml2 2.6.2". XML::LibXML::LIBXML_VERSION $Version_ID = XML::LibXML::LIBXML_VERSION; Returns the version id of the libxml2 version XML::LibXML was compiled for. This will be "20602" for "libxml2 2.6.2". Don't mix this version id with $XML::LibXML::VERSION. The latter contains the version of XML::LibXML itself while the first contains the version of libxml2 XML::LibXML was compiled for. XML::LibXML::LIBXML_RUNTIME_VERSION $DLL_Version = XML::LibXML::LIBXML_RUNTIME_VERSION; Returns a version string of the libxml2 which is (usually dynamically) linked by XML::LibXML. This will be "20602" for libxml2 released as "2.6.2" and something like "20602-CVS2032" for a CVS build of libxml2. XML::LibXML issues a warning if the version of libxml2 dynamically linked to it is less than the version of libxml2 which it was compiled against. EXPORTS By default the module exports all constants and functions listed in the :all tag, described below. EXPORT TAGS :all Includes the tags :libxml, :encoding, and :ns described below. :libxml Exports integer constants for DOM node types. XML_ELEMENT_NODE => 1 XML_ATTRIBUTE_NODE => 2 XML_TEXT_NODE => 3 XML_CDATA_SECTION_NODE => 4 XML_ENTITY_REF_NODE => 5 XML_ENTITY_NODE => 6 XML_PI_NODE => 7 XML_COMMENT_NODE => 8 XML_DOCUMENT_NODE => 9 XML_DOCUMENT_TYPE_NODE => 10 XML_DOCUMENT_FRAG_NODE => 11 XML_NOTATION_NODE => 12 XML_HTML_DOCUMENT_NODE => 13 XML_DTD_NODE => 14 XML_ELEMENT_DECL => 15 XML_ATTRIBUTE_DECL => 16 XML_ENTITY_DECL => 17 XML_NAMESPACE_DECL => 18 XML_XINCLUDE_START => 19 XML_XINCLUDE_END => 20 :encoding Exports two encoding conversion functions from XML::LibXML::Common. encodeToUTF8() decodeFromUTF8() :ns Exports two convenience constants: the implicit namespace of the reserved xml: prefix, and the implicit namespace for the reserved xmlns: prefix. XML_XML_NS => 'http://www.w3.org/XML/1998/namespace' XML_XMLNS_NS => 'http://www.w3.org/2000/xmlns/' Related Modules The modules described in this section are not part of the XML::LibXML package itself. As they support some additional features, they are mentioned here. XML::LibXSLT XSLT 1.0 Processor using libxslt and XML::LibXML XML::LibXML::Iterator XML::LibXML Implementation of the DOM Traversal Specification XML::CompactTree::XS Uses XML::LibXML::Reader to very efficiently to parse XML document or element into native Perl data structures, which are less flexible but significantly faster to process then DOM. XML::LibXML and XML::GDOME Note: THE FUNCTIONS DESCRIBED HERE ARE STILL EXPERIMENTAL Although both modules make use of libxml2's XML capabilities, the DOM implementation of both modules are not compatible. But still it is possible to exchange nodes from one DOM to the other. The concept of this exchange is pretty similar to the function cloneNode(): The particular node is copied on the low-level to the opposite DOM implementation. Since the DOM implementations cannot coexist within one document, one is forced to copy each node that should be used. Because you are always keeping two nodes this may cause quite an impact on a machines memory usage. XML::LibXML provides two functions to export or import GDOME nodes: import_GDOME() and export_GDOME(). Both function have two parameters: the node and a flag for recursive import. The flag works as in cloneNode(). The two functions allow one to export and import XML::GDOME nodes explicitly, however, XML::LibXML also allows the transparent import of XML::GDOME nodes in functions such as appendChild(), insertAfter() and so on. While native nodes are automatically adopted in most functions XML::GDOME nodes are always cloned in advance. Thus if the original node is modified after the operation, the node in the XML::LibXML document will not have this information. import_GDOME $libxmlnode = XML::LibXML->import_GDOME( $node, $deep ); This clones an XML::GDOME node to an XML::LibXML node explicitly. export_GDOME $gdomenode = XML::LibXML->export_GDOME( $node, $deep ); Allows one to clone an XML::LibXML node into an XML::GDOME node. CONTACTS For bug reports, please use the CPAN request tracker on http://rt.cpan.org/NoAuth/Bugs.html?Dist=XML-LibXML For suggestions etc., and other issues related to XML::LibXML you may use the perl XML mailing list (perl-xml@listserv.ActiveState.com), where most XML-related Perl modules are discussed. In case of problems you should check the archives of that list first. Many problems are already discussed there. You can find the list's archives and subscription options at http://aspn.activestate.com/ASPN/Mail/Browse/Threaded/perl-xml. Parsing XML Data with XML::LibXML XML::LibXML::Parser Synopsis use XML::LibXML '1.70'; Parsing An XML document is read into a data structure such as a DOM tree by a piece of software, called a parser. XML::LibXML currently provides four different parser interfaces: A DOM Pull-Parser A DOM Push-Parser A SAX Parser A DOM based SAX Parser. Creating a Parser Instance XML::LibXML provides an OO interface to the libxml2 parser functions. Thus you have to create a parser instance before you can parse any XML data. new # Parser constructor $parser = XML::LibXML->new(); $parser = XML::LibXML->new(option=>value, ...); $parser = XML::LibXML->new({option=>value, ...}); Create a new XML and HTML parser instance. Each parser instance holds default values for various parser options. Optionally, one can pass a hash reference or a list of option => value pairs to set a different default set of options. Unless specified otherwise, the options load_ext_dtd, and expand_entities are set to 1. See for a list of libxml2 parser's options. DOM Parser One of the common parser interfaces of XML::LibXML is the DOM parser. This parser reads XML data into a DOM like data structure, so each tag can get accessed and transformed. XML::LibXML's DOM parser is not only capable to parse XML data, but also (strict) HTML files. There are three ways to parse documents - as a string, as a Perl filehandle, or as a filename/URL. The return value from each is a object, which is a DOM object. All of the functions listed below will throw an exception if the document is invalid. To prevent this causing your program exiting, wrap the call in an eval{} block load_xml # Parsing XML $dom = XML::LibXML->load_xml( location => $file_or_url # parser options ... ); $dom = XML::LibXML->load_xml( string => $xml_string # parser options ... ); $dom = XML::LibXML->load_xml( string => (\$xml_string) # parser options ... ); $dom = XML::LibXML->load_xml({ IO => $perl_file_handle # parser options ... ); $dom = $parser->load_xml(...); This function is available since XML::LibXML 1.70. It provides easy to use interface to the XML parser that parses given file (or non-HTTPS URL), string, or input stream to a DOM tree. The arguments can be passed in a HASH reference or as name => value pairs. The function can be called as a class method or an object method. In both cases it internally creates a new parser instance passing the specified parser options; if called as an object method, it clones the original parser (preserving its settings) and additionally applies the specified options to the new parser. See the constructor new and for more information. Note that, due to a limitation in the underlying libxml2 library, this call does not recognize HTTPS-based URLs. (It will treat an HTTPS URL as a filename, likely throwing a "No such file or directory" exception.) load_html # Parsing HTML $dom = XML::LibXML->load_html(...); $dom = $parser->load_html(...); This function is available since XML::LibXML 1.70. It has the same usage as load_xml, providing interface to the HTML parser. See load_xml for more information. Parsing HTML may cause problems, especially if the ampersand ('&') is used. This is a common problem if HTML code is parsed that contains links to CGI-scripts. Such links cause the parser to throw errors. In such cases libxml2 still parses the entire document as there was no error, but the error causes XML::LibXML to stop the parsing process. However, the document is not lost. Such HTML documents should be parsed using the recover flag. By default recovering is deactivated. The functions described above are implemented to parse well formed documents. In some cases a program gets well balanced XML instead of well formed documents (e.g. an XML fragment from a database). With XML::LibXML it is not required to wrap such fragments in the code, because XML::LibXML is capable even to parse well balanced XML fragments. parse_balanced_chunk # Parsing well-balanced XML chunks $fragment = $parser->parse_balanced_chunk( $wbxmlstring, $encoding ); This function parses a well balanced XML string into a . The first arguments contains the input string, the optional second argument can be used to specify character encoding of the input (UTF-8 is assumed by default). parse_xml_chunk This is the old name of parse_balanced_chunk(). Because it may causes confusion with the push parser interface, this function should not be used anymore. By default XML::LibXML does not process XInclude tags within an XML Document (see options section below). XML::LibXML allows one to post-process a document to expand XInclude tags. process_xincludes # Processing XInclude $parser->process_xincludes( $doc ); After a document is parsed into a DOM structure, you may want to expand the documents XInclude tags. This function processes the given document structure and expands all XInclude tags (or throws an error) by using the flags and callbacks of the given parser instance. Note that the resulting Tree contains some extra nodes (of type XML_XINCLUDE_START and XML_XINCLUDE_END) after successfully processing the document. These nodes indicate where data was included into the original tree. if the document is serialized, these extra nodes will not show up. Remember: A Document with processed XIncludes differs from the original document after serialization, because the original XInclude tags will not get restored! If the parser flag "expand_xincludes" is set to 1, you need not to post process the parsed document. processXIncludes $parser->processXIncludes( $doc ); This is an alias to process_xincludes, but through a JAVA like function name. parse_file # Old-style parser interfaces $doc = $parser->parse_file( $xmlfilename ); This function parses an XML document from a file or network; $xmlfilename can be either a filename or a (non-HTTPS) URL. Note that for parsing files, this function is the fastest choice, about 6-8 times faster then parse_fh(). parse_fh $doc = $parser->parse_fh( $io_fh ); parse_fh() parses a IOREF or a subclass of IO::Handle. Because the data comes from an open handle, libxml2's parser does not know about the base URI of the document. To set the base URI one should use parse_fh() as follows: my $doc = $parser->parse_fh( $io_fh, $baseuri ); parse_string $doc = $parser->parse_string( $xmlstring); This function is similar to parse_fh(), but it parses an XML document that is available as a single string in memory, or alternatively as a reference to a scalar containing a string. Again, you can pass an optional base URI to the function. my $doc = $parser->parse_string( $xmlstring, $baseuri ); my $doc = $parser->parse_string(\$xmlstring, $baseuri); parse_html_file $doc = $parser->parse_html_file( $htmlfile, \%opts ); Similar to parse_file() but parses HTML (strict) documents; $htmlfile can be filename or (non-HTTPS) URL. An optional second argument can be used to pass some options to the HTML parser as a HASH reference. See options labeled with HTML in . parse_html_fh $doc = $parser->parse_html_fh( $io_fh, \%opts ); Similar to parse_fh() but parses HTML (strict) streams. An optional second argument can be used to pass some options to the HTML parser as a HASH reference. See options labeled with HTML in . Note: encoding option may not work correctly with this function in libxml2 < 2.6.27 if the HTML file declares charset using a META tag. parse_html_string $doc = $parser->parse_html_string( $htmlstring, \%opts ); Similar to parse_string() but parses HTML (strict) strings. An optional second argument can be used to pass some options to the HTML parser as a HASH reference. See options labeled with HTML in . Push Parser XML::LibXML provides a push parser interface. Rather than pulling the data from a given source the push parser waits for the data to be pushed into it. This allows one to parse large documents without waiting for the parser to finish. The interface is especially useful if a program needs to pre-process the incoming pieces of XML (e.g. to detect document boundaries). While XML::LibXML parse_*() functions force the data to be a well-formed XML, the push parser will take any arbitrary string that contains some XML data. The only requirement is that all the pushed strings are together a well formed document. With the push parser interface a program can interrupt the parsing process as required, where the parse_*() functions give not enough flexibility. Different to the pull parser implemented in parse_fh() or parse_file(), the push parser is not able to find out about the documents end itself. Thus the calling program needs to indicate explicitly when the parsing is done. In XML::LibXML this is done by a single function: parse_chunk # Push parser $parser->parse_chunk($string, $terminate); parse_chunk() tries to parse a given chunk of data, which isn't necessarily well balanced data. The function takes two parameters: The chunk of data as a string and optional a termination flag. If the termination flag is set to a true value (e.g. 1), the parsing will be stopped and the resulting document will be returned as the following example describes: my $parser = XML::LibXML->new; for my $string ( "<", "foo", ' bar="hello world"', "/>") { $parser->parse_chunk( $string ); } my $doc = $parser->parse_chunk("", 1); # terminate the parsing Internally XML::LibXML provides three functions that control the push parser process: init_push $parser->init_push(); Initializes the push parser. push $parser->push(@data); This function pushes the data stored inside the array to libxml2's parser. Each entry in @data must be a normal scalar! This method can be called repeatedly. finish_push $doc = $parser->finish_push( $recover ); This function returns the result of the parsing process. If this function is called without a parameter it will complain about non well-formed documents. If $restore is 1, the push parser can be used to restore broken or non well formed (XML) documents as the following example shows: eval { $parser->push( "<foo>", "bar" ); $doc = $parser->finish_push(); # will report broken XML }; if ( $@ ) { # ... } This can be annoying if the closing tag is missed by accident. The following code will restore the document: eval { $parser->push( "<foo>", "bar" ); $doc = $parser->finish_push(1); # will return the data parsed # unless an error happened }; print $doc->toString(); # returns "<foo>bar</foo>" Of course finish_push() will return nothing if there was no data pushed to the parser before. Pull Parser (Reader) XML::LibXML also provides a pull-parser interface similar to the XmlReader interface in .NET. This interface is almost streaming, and is usually faster and simpler to use than SAX. See . Direct SAX Parser XML::LibXML provides a direct SAX parser in the module. DOM based SAX Parser XML::LibXML also provides a DOM based SAX parser. The SAX parser is defined in the module XML::LibXML::SAX::Parser. As it is not a stream based parser, it parses documents into a DOM and traverses the DOM tree instead. The API of this parser is exactly the same as any other Perl SAX2 parser. See XML::SAX::Intro for details. Aside from the regular parsing methods, you can access the DOM tree traverser directly, using the generate() method: my $doc = build_yourself_a_document(); my $saxparser = $XML::LibXML::SAX::Parser->new( ... ); $parser->generate( $doc ); This is useful for serializing DOM trees, for example that you might have done prior processing on, or that you have as a result of XSLT processing. WARNING This is NOT a streaming SAX parser. As I said above, this parser reads the entire document into a DOM and serialises it. Some people couldn't read that in the paragraph above so I've added this warning. If you want a streaming SAX parser look at the man page Serialization XML::LibXML provides some functions to serialize nodes and documents. The serialization functions are described on the manpage or the manpage. XML::LibXML checks three global flags that alter the serialization process: skipXMLDeclaration skipDTD setTagCompression of that three functions only setTagCompression is available for all serialization functions. Because XML::LibXML does these flags not itself, one has to define them locally as the following example shows: local $XML::LibXML::skipXMLDeclaration = 1; local $XML::LibXML::skipDTD = 1; local $XML::LibXML::setTagCompression = 1; If skipXMLDeclaration is defined and not '0', the XML declaration is omitted during serialization. If skipDTD is defined and not '0', an existing DTD would not be serialized with the document. If setTagCompression is defined and not '0' empty tags are displayed as open and closing tags rather than the shortcut. For example the empty tag foo will be rendered as <foo></foo> rather than <foo/>. Parser Options Handling of libxml2 parser options has been unified and improved in XML::LibXML 1.70. You can now set default options for a particular parser instance by passing them to the constructor as XML::LibXML->new({name=>value, ...}) or XML::LibXML->new(name=>value,...). The options can be queried and changed using the following methods (pre-1.70 interfaces such as $parser->load_ext_dtd(0) also exist, see below): option_exists # Set/query parser options $parser->option_exists($name); Returns 1 if the current XML::LibXML version supports the option $name, otherwise returns 0 (note that this does not necessarily mean that the option is supported by the underlying libxml2 library). get_option $parser->get_option($name); Returns the current value of the parser option $name. set_option $parser->set_option($name,$value); Sets option $name to value $value. set_options $parser->set_options({$name=>$value,...}); Sets multiple parsing options at once. IMPORTANT NOTE: This documentation reflects the parser flags available in libxml2 2.7.3. Some options have no effect if an older version of libxml2 is used. Each of the flags listed below is labeled /parser/ if it can be used with a XML::LibXML parser object (i.e. passed to XML::LibXML->new, XML::LibXML->set_option, etc.) /html/ if it can be used passed to the parse_html_* methods /reader/ if it can be used with the XML::LibXML::Reader. Unless specified otherwise, the default for boolean valued options is 0 (false). The available options are: URI /parser, html, reader/ In case of parsing strings or file handles, XML::LibXML doesn't know about the base uri of the document. To make relative references such as XIncludes work, one has to set a base URI, that is then used for the parsed document. line_numbers /parser, html, reader/ If this option is activated, libxml2 will store the line number of each element node in the parsed document. The line number can be obtained using the line_number() method of the XML::LibXML::Node class (for non-element nodes this may report the line number of the containing element). The line numbers are also used for reporting positions of validation errors. IMPORTANT: Due to limitations in the libxml2 library line numbers greater than 65535 will be returned as 65535. Unfortunately, this is a long and sad story, please see http://bugzilla.gnome.org/show_bug.cgi?id=325533 for more details. encoding /html/ character encoding of the input recover /parser, html, reader/ recover from errors; possible values are 0, 1, and 2 A true value turns on recovery mode which allows one to parse broken XML or HTML data. The recovery mode allows the parser to return the successfully parsed portion of the input document. This is useful for almost well-formed documents, where for example a closing tag is missing somewhere. Still, XML::LibXML will only parse until the first fatal (non-recoverable) error occurs, reporting recoverable parsing errors as warnings. To suppress even these warnings, use recover=>2. Note that validation is switched off automatically in recovery mode. expand_entities /parser, reader/ substitute entities; possible values are 0 and 1; default is 1 Note that although this flag disables entity substitution, it does not prevent the parser from loading external entities; when substitution of an external entity is disabled, the entity will be represented in the document tree by an XML_ENTITY_REF_NODE node whose subtree will be the content obtained by parsing the external resource; Although this nesting is visible from the DOM it is transparent to XPath data model, so it is possible to match nodes in an unexpanded entity by the same XPath expression as if the entity were expanded. See also ext_ent_handler. ext_ent_handler /parser/ Provide a custom external entity handler to be used when expand_entities is set to 1. Possible value is a subroutine reference. This feature does not work properly in libxml2 < 2.6.27! The subroutine provided is called whenever the parser needs to retrieve the content of an external entity. It is called with two arguments: the system ID (URI) and the public ID. The value returned by the subroutine is parsed as the content of the entity. This method can be used to completely disable entity loading, e.g. to prevent exploits of the type described at , where a service is tricked to expose its private data by letting it parse a remote file (RSS feed) that contains an entity reference to a local file (e.g. /etc/fstab). A more granular solution to this problem, however, is provided by custom URL resolvers, as in my $c = XML::LibXML::InputCallback->new(); sub match { # accept file:/ URIs except for XML catalogs in /etc/xml/ my ($uri) = @_; return ($uri=~m{^file:/} and $uri !~ m{^file:///etc/xml/}) ? 1 : 0; } $c->register_callbacks([ \&match, sub{}, sub{}, sub{} ]); $parser->input_callbacks($c); load_ext_dtd /parser, reader/ load the external DTD subset while parsing; possible values are 0 and 1. Unless specified, XML::LibXML sets this option to 1. This flag is also required for DTD Validation, to provide complete attribute, and to expand entities, regardless if the document has an internal subset. Thus switching off external DTD loading, will disable entity expansion, validation, and complete attributes on internal subsets as well. complete_attributes /parser, reader/ create default DTD attributes; possible values are 0 and 1 validation /parser, reader/ validate with the DTD; possible values are 0 and 1 suppress_errors /parser, html, reader/ suppress error reports; possible values are 0 and 1 suppress_warnings /parser, html, reader/ suppress warning reports; possible values are 0 and 1 pedantic_parser /parser, html, reader/ pedantic error reporting; possible values are 0 and 1 no_blanks /parser, html, reader/ remove blank nodes; possible values are 0 and 1 no_defdtd /html/ do not add a default DOCTYPE; possible values are 0 and 1 the default is (0) to add a DTD when the input html lacks one expand_xinclude or xinclude /parser, reader/ Implement XInclude substitution; possible values are 0 and 1 Expands XInclude tags immediately while parsing the document. Note that the parser will use the URI resolvers installed via XML::LibXML::InputCallback to parse the included document (if any). no_xinclude_nodes /parser, reader/ do not generate XINCLUDE START/END nodes; possible values are 0 and 1 no_network /parser, html, reader/ Forbid network access; possible values are 0 and 1 If set to true, all attempts to fetch non-local resources (such as DTD or external entities) will fail (unless custom callbacks are defined). It may be necessary to use the flag recover for processing documents requiring such resources while networking is off. clean_namespaces /parser, reader/ remove redundant namespaces declarations during parsing; possible values are 0 and 1. no_cdata /parser, html, reader/ merge CDATA as text nodes; possible values are 0 and 1 no_basefix /parser, reader/ not fixup XINCLUDE xml#base URIS; possible values are 0 and 1 huge /parser, html, reader/ relax any hardcoded limit from the parser; possible values are 0 and 1. Unless specified, XML::LibXML sets this option to 0. Note: the default value for this option was changed to protect against denial of service through entity expansion attacks. Before enabling the option ensure you have taken alternative measures to protect your application against this type of attack. gdome /parser/ THIS OPTION IS EXPERIMENTAL! Although quite powerful, XML::LibXML's DOM implementation is incomplete with respect to the DOM level 2 or level 3 specifications. XML::GDOME is based on libxml2 as well, and provides a rather complete DOM implementation by wrapping libgdome. This flag allows you to make use of XML::LibXML's full parser options and XML::GDOME's DOM implementation at the same time. To make use of this function, one has to install libgdome and configure XML::LibXML to use this library. For this you need to rebuild XML::LibXML! Note: this feature was not seriously tested in recent XML::LibXML releases. For compatibility with XML::LibXML versions prior to 1.70, the following methods are also supported for querying and setting the corresponding parser options (if called without arguments, the methods return the current value of the corresponding parser options; with an argument sets the option to a given value): $parser->validation(); $parser->recover(); $parser->pedantic_parser(); $parser->line_numbers(); $parser->load_ext_dtd(); $parser->complete_attributes(); $parser->expand_xinclude(); $parser->gdome_dom(); $parser->clean_namespaces(); $parser->no_network(); The following obsolete methods trigger parser options in some special way: recover_silently $parser->recover_silently(1); If called without an argument, returns true if the current value of the recover parser option is 2 and returns false otherwise. With a true argument sets the recover parser option to 2; with a false argument sets the recover parser option to 0. expand_entities $parser->expand_entities(0); Get/set the expand_entities option. If called with a true argument, also turns the load_ext_dtd option to 1. keep_blanks $parser->keep_blanks(0); This is actually the opposite of the no_blanks parser option. If used without an argument retrieves negated value of no_blanks. If used with an argument sets no_blanks to the opposite value. base_uri $parser->base_uri( $your_base_uri ); Get/set the URI option. XML Catalogs libxml2 supports XML catalogs. Catalogs are used to map remote resources to their local copies. Using catalogs can speed up parsing processes if many external resources from remote addresses are loaded into the parsed documents (such as DTDs or XIncludes). Note that libxml2 has a global pool of loaded catalogs, so if you apply the method load_catalog to one parser instance, all parser instances will start using the catalog (in addition to other previously loaded catalogs). Note also that catalogs are not used when a custom external entity handler is specified. At the current state it is not possible to make use of both types of resolving systems at the same time. load_catalog # XML catalogs $parser->load_catalog( $catalog_file ); Loads the XML catalog file $catalog_file. # Global external entity loader (similar to ext_ent_handler option # but this works really globally, also in XML::LibXSLT include etc..) XML::LibXML::externalEntityLoader(\&my_loader); Error Reporting XML::LibXML throws exceptions during parsing, validation or XPath processing (and some other occasions). These errors can be caught by using eval blocks. The error is stored in $@. There are two implementations: the old one throws $@ which is just a message string, in the new one $@ is an object from the class XML::LibXML::Error; this class overrides the operator "" so that when printed, the object flattens to the usual error message. XML::LibXML throws errors as they occur. This is a very common misunderstanding in the use of XML::LibXML. If the eval is omitted, XML::LibXML will always halt your script by "croaking" (see Carp man page for details). Also note that an increasing number of functions throw errors if bad data is passed as arguments. If you cannot assure valid data passed to XML::LibXML you should eval these functions. Note: since version 1.59, get_last_error() is no longer available in XML::LibXML for thread-safety reasons. XML::LibXML direct SAX parser XML::LibXML::SAX Description XML::LibXML provides an interface to libxml2 direct SAX interface. Through this interface it is possible to generate SAX events directly while parsing a document. While using the SAX parser XML::LibXML will not create a DOM Document tree. Such an interface is useful if very large XML documents have to be processed and no DOM functions are required. By using this interface it is possible to read data stored within an XML document directly into the application data structures without loading the document into memory. The SAX interface of XML::LibXML is based on the famous XML::SAX interface. It uses the generic interface as provided by XML::SAX::Base. Additionally to the generic functions, which are only able to process entire documents, XML::LibXML::SAX provides parse_chunk(). This method generates SAX events from well balanced data such as is often provided by databases. Features NOTE: This feature is experimental. You can enable character data joining which may yield a significant speed boost in your XML processing in lower markup ratio situations by enabling the http://xmlns.perl.org/sax/join-character-data feature of this parser. This is done via the set_feature method like this: $p->set_feature('http://xmlns.perl.org/sax/join-character-data', 1); You can also specify a 0 to disable. The default is to have this feature disabled. Building DOM trees from SAX events. XML::LibXML::SAX::Builder Synopsis use XML::LibXML::SAX::Builder; my $builder = XML::LibXML::SAX::Builder->new(); my $gen = XML::Generator::DBI->new(Handler => $builder, dbh => $dbh); $gen->execute("SELECT * FROM Users"); my $doc = $builder->result(); Description This is a SAX handler that generates a DOM tree from SAX events. Usage is as above. Input is accepted from any SAX1 or SAX2 event generator. Building DOM trees from SAX events is quite easy with XML::LibXML::SAX::Builder. The class is designed as a SAX2 final handler not as a filter! Since SAX is strictly stream oriented, you should not expect anything to return from a generator. Instead you have to ask the builder instance directly to get the document built. XML::LibXML::SAX::Builder's result() function holds the document generated from the last SAX stream. XML::LibXML DOM Implementation XML::LibXML::DOM Description XML::LibXML provides a lightweight interface to modify a node of the document tree generated by the XML::LibXML parser. This interface follows as far as possible the DOM Level 3 specification. In addition to the specified functions, XML::LibXML supports some functions that are more handy to use in the perl environment. One also has to remember, that XML::LibXML is an interface to libxml2 nodes which actually reside on the C-Level of XML::LibXML. This means each node is a reference to a structure which is different from a perl hash or array. The only way to access these structures' values is through the DOM interface provided by XML::LibXML. This also means, that one can't simply inherit an XML::LibXML node and add new member variables as if they were hash keys. The DOM interface of XML::LibXML does not intend to implement a full DOM interface as it is done by XML::GDOME and used for full featured application. Moreover, it offers an simple way to build or modify documents that are created by XML::LibXML's parser. Another target of the XML::LibXML interface is to make the interfaces of libxml2 available to the perl community. This includes also some workarounds to some features where libxml2 assumes more control over the C-Level that most perl users don't have. One of the most important parts of the XML::LibXML DOM interface is that the interfaces try to follow the DOM Level 3 specification rather strictly. This means the interface functions are named as the DOM specification says and not what widespread Java interfaces claim to be the standard. Although there are several functions that have only a singular interface that conforms to the DOM spec XML::LibXML provides an additional Java style alias interface. Moreover, there are some function interfaces left over from early stages of XML::LibXML for compatibility reasons. These interfaces are for compatibility reasons only. They might disappear in one of the future versions of XML::LibXML, so a user is requested to switch over to the official functions. Encodings and XML::LibXML's DOM implementation See the section on Encodings in the XML::LibXML manual page. Namespaces and XML::LibXML's DOM implementation XML::LibXML's DOM implementation is limited by the DOM implementation of libxml2 which treats namespaces slightly differently than required by the DOM Level 2 specification. According to the DOM Level 2 specification, namespaces of elements and attributes should be persistent, and nodes should be permanently bound to namespace URIs as they get created; it should be possible to manipulate the special attributes used for declaring XML namespaces just as other attributes without affecting the namespaces of other nodes. In DOM Level 2, the application is responsible for creating the special attributes consistently and/or for correct serialization of the document. This is both inconvenient, causes problems in serialization of DOM to XML, and most importantly, seems almost impossible to implement over libxml2. In libxml2, namespace URI and prefix of a node is provided by a pointer to a namespace declaration (appearing as a special xmlns attribute in the XML document). If the prefix or namespace URI of the declaration changes, the prefix and namespace URI of all nodes that point to it changes as well. Moreover, in contrast to DOM, a node (element or attribute) can only be bound to a namespace URI if there is some namespace declaration in the document to point to. Therefore current DOM implementation in XML::LibXML tries to treat namespace declarations in a compromise between reason, common sense, limitations of libxml2, and the DOM Level 2 specification. In XML::LibXML, special attributes declaring XML namespaces are often created automatically, usually when a namespaced node is attached to a document and no existing declaration of the namespace and prefix is in the scope to be reused. In this respect, XML::LibXML DOM implementation differs from the DOM Level 2 specification according to which special attributes for declaring the appropriate XML namespaces should not be added when a node with a namespace prefix and namespace URI is created. Namespace declarations are also created when 's createElementNS() or createAttributeNS() function are used. If the a namespace is not declared on the documentElement, the namespace will be locally declared for the newly created node. In case of Attributes this may look a bit confusing, since these nodes cannot have namespace declarations itself. In this case the namespace is internally applied to the attribute and later declared on the node the attribute is appended to (if required). The following example may explain this a bit: my $doc = XML::LibXML->createDocument; my $root = $doc->createElementNS( "", "foo" ); $doc->setDocumentElement( $root ); my $attr = $doc->createAttributeNS( "bar", "bar:foo", "test" ); $root->setAttributeNodeNS( $attr ); This piece of code will result in the following document: <?xml version="1.0"?> <foo xmlns:bar="bar" bar:foo="test"/> The namespace is declared on the document element during the setAttributeNodeNS() call. Namespaces can be also declared explicitly by the use of XML::LibXML::Element's setNamespace() function. Since 1.61, they can also be manipulated with functions setNamespaceDeclPrefix() and setNamespaceDeclURI() (not available in DOM). Changing an URI or prefix of an existing namespace declaration affects the namespace URI and prefix of all nodes which point to it (that is the nodes in its scope). It is also important to repeat the specification: While working with namespaces you should use the namespace aware functions instead of the simplified versions. For example you should never use setAttribute() but setAttributeNS(). XML::LibXML DOM Document Class XML::LibXML::Document Synopsis use XML::LibXML; # Only methods specific to Document nodes are listed here, # see the XML::LibXML::Node manpage for other methods Description The Document Class is in most cases the result of a parsing process. But sometimes it is necessary to create a Document from scratch. The DOM Document Class provides functions that conform to the DOM Core naming style. It inherits all functions from as specified in the DOM specification. This enables access to the nodes besides the root element on document level - a DTD for example. The support for these nodes is limited at the moment. While generally nodes are bound to a document in the DOM concept it is suggested that one should always create a node not bound to any document. There is no need of really including the node to the document, but once the node is bound to a document, it is quite safe that all strings have the correct encoding. If an unbound text node with an ISO encoded string is created (e.g. with $CLASS->new()), the toString function may not return the expected result. To prevent such problems, it is recommended to pass all data to XML::LibXML methods as character strings (i.e. UTF-8 encoded, with the UTF8 flag on). Methods Many functions listed here are extensively documented in the DOM Level 3 specification. Please refer to the specification for extensive documentation. new $dom = XML::LibXML::Document->new( $version, $encoding ); alias for createDocument() createDocument $dom = XML::LibXML::Document->createDocument( $version, $encoding ); The constructor for the document class. As Parameter it takes the version string and (optionally) the encoding string. Simply calling createDocument() will create the document: <?xml version="your version" encoding="your encoding"?> Both parameter are optional. The default value for $version is 1.0, of course. If the $encoding parameter is not set, the encoding will be left unset, which means UTF-8 is implied. The call of createDocument() without any parameter will result the following code: <?xml version="1.0"?> Alternatively one can call this constructor directly from the XML::LibXML class level, to avoid some typing. This will not have any effect on the class instance, which is always XML::LibXML::Document. my $document = XML::LibXML->createDocument( "1.0", "UTF-8" ); is therefore a shortcut for my $document = XML::LibXML::Document->createDocument( "1.0", "UTF-8" ); URI $strURI = $doc->URI(); Returns the URI (or filename) of the original document. For documents obtained by parsing a string of a FH without using the URI parsing argument of the corresponding parse_* function, the result is a generated string unknown-XYZ where XYZ is some number; for documents created with the constructor new, the URI is undefined. The value can be modified by calling setURI method on the document node. setURI $doc->setURI($strURI); Sets the URI of the document reported by the method URI (see also the URI argument to the various parse_* functions). encoding $strEncoding = $doc->encoding(); returns the encoding string of the document. my $doc = XML::LibXML->createDocument( "1.0", "ISO-8859-15" ); print $doc->encoding; # prints ISO-8859-15 actualEncoding $strEncoding = $doc->actualEncoding(); returns the encoding in which the XML will be returned by $doc->toString(). This is usually the original encoding of the document as declared in the XML declaration and returned by $doc->encoding. If the original encoding is not known (e.g. if created in memory or parsed from a XML without a declared encoding), 'UTF-8' is returned. my $doc = XML::LibXML->createDocument( "1.0", "ISO-8859-15" ); print $doc->encoding; # prints ISO-8859-15 setEncoding $doc->setEncoding($new_encoding); This method allows one to change the declaration of encoding in the XML declaration of the document. The value also affects the encoding in which the document is serialized to XML by $doc->toString(). Use setEncoding() to remove the encoding declaration. version $strVersion = $doc->version(); returns the version string of the document getVersion() is an alternative form of this function. standalone $doc->standalone This function returns the Numerical value of a documents XML declarations standalone attribute. It returns 1 if standalone="yes" was found, 0 if standalone="no" was found and -1 if standalone was not specified (default on creation). setStandalone $doc->setStandalone($numvalue); Through this method it is possible to alter the value of a documents standalone attribute. Set it to 1 to set standalone="yes", to 0 to set standalone="no" or set it to -1 to remove the standalone attribute from the XML declaration. compression my $compression = $doc->compression; libxml2 allows reading of documents directly from gzipped files. In this case the compression variable is set to the compression level of that file (0-8). If XML::LibXML parsed a different source or the file wasn't compressed, the returned value will be -1. setCompression $doc->setCompression($ziplevel); If one intends to write the document directly to a file, it is possible to set the compression level for a given document. This level can be in the range from 0 to 8. If XML::LibXML should not try to compress use -1 (default). Note that this feature will only work if libxml2 is compiled with zlib support and toFile() is used for output. toString $docstring = $dom->toString($format); toString is a DOM serializing function, so the DOM Tree is serialized into an XML string, ready for output. IMPORTANT: unlike toString for other nodes, on document nodes this function returns the XML as a byte string in the original encoding of the document (see the actualEncoding() method)! This means you can simply do: open my $out_fh, '>', $file; print {$out_fh} $doc->toString; regardless of the actual encoding of the document. See the section on encodings in for more details. The optional $format parameter sets the indenting of the output. This parameter is expected to be an integer value, that specifies that indentation should be used. The format parameter can have three different values if it is used: If $format is 0, than the document is dumped as it was originally parsed If $format is 1, libxml2 will add ignorable white spaces, so the nodes content is easier to read. Existing text nodes will not be altered If $format is 2 (or higher), libxml2 will act as $format == 1 but it add a leading and a trailing line break to each text node. libxml2 uses a hard-coded indentation of 2 space characters per indentation level. This value can not be altered on run-time. toStringC14N $c14nstr = $doc->toStringC14N($comment_flag, $xpath [, $xpath_context ]); See the documentation in . toStringEC14N $ec14nstr = $doc->toStringEC14N($comment_flag, $xpath [, $xpath_context ], $inclusive_prefix_list); See the documentation in . serialize $str = $doc->serialize($format); An alias for toString(). This function was name added to be more consistent with libxml2. serialize_c14n An alias for toStringC14N(). serialize_exc_c14n An alias for toStringEC14N(). toFile $state = $doc->toFile($filename, $format); This function is similar to toString(), but it writes the document directly into a filesystem. This function is very useful, if one needs to store large documents. The format parameter has the same behaviour as in toString(). toFH $state = $doc->toFH($fh, $format); This function is similar to toString(), but it writes the document directly to a filehandle or a stream. A byte stream in the document encoding is passed to the file handle. Do NOT apply any :encoding(...) or :utf8 PerlIO layer to the filehandle! See the section on encodings in for more details. The format parameter has the same behaviour as in toString(). toStringHTML $str = $document->toStringHTML(); toStringHTML serialize the tree to a byte string in the document encoding as HTML. With this method indenting is automatic and managed by libxml2 internally. serialize_html $str = $document->serialize_html(); An alias for toStringHTML(). is_valid $bool = $dom->is_valid(); Returns either TRUE or FALSE depending on whether the DOM Tree is a valid Document or not. You may also pass in a object, to validate against an external DTD: if (!$dom->is_valid($dtd)) { warn("document is not valid!"); } validate $dom->validate(); This is an exception throwing equivalent of is_valid. If the document is not valid it will throw an exception containing the error. This allows you much better error reporting than simply is_valid or not. Again, you may pass in a DTD object documentElement $root = $dom->documentElement(); Returns the root element of the Document. A document can have just one root element to contain the documents data. Optionally one can use getDocumentElement. setDocumentElement $dom->setDocumentElement( $root ); This function enables you to set the root element for a document. The function supports the import of a node from a different document tree, but does not support a document fragment as $root. createElement $element = $dom->createElement( $nodename ); This function creates a new Element Node bound to the DOM with the name $nodename. createElementNS $element = $dom->createElementNS( $namespaceURI, $nodename ); This function creates a new Element Node bound to the DOM with the name $nodename and placed in the given namespace. createTextNode $text = $dom->createTextNode( $content_text ); As an equivalent of createElement, but it creates a Text Node bound to the DOM. createComment $comment = $dom->createComment( $comment_text ); As an equivalent of createElement, but it creates a Comment Node bound to the DOM. createAttribute $attrnode = $doc->createAttribute($name [,$value]); Creates a new Attribute node. createAttributeNS $attrnode = $doc->createAttributeNS( namespaceURI, $name [,$value] ); Creates an Attribute bound to a namespace. createDocumentFragment $fragment = $doc->createDocumentFragment(); This function creates a DocumentFragment. createCDATASection $cdata = $dom->createCDATASection( $cdata_content ); Similar to createTextNode and createComment, this function creates a CDataSection bound to the current DOM. createProcessingInstruction my $pi = $doc->createProcessingInstruction( $target, $data ); create a processing instruction node. Since this method is quite long one may use its short form createPI(). createEntityReference my $entref = $doc->createEntityReference($refname); If a document has a DTD specified, one can create entity references by using this function. If one wants to add a entity reference to the document, this reference has to be created by this function. An entity reference is unique to a document and cannot be passed to other documents as other nodes can be passed. NOTE: A text content containing something that looks like an entity reference, will not be expanded to a real entity reference unless it is a predefined entity my $string = "&foo;"; $some_element->appendText( $string ); print $some_element->textContent; # prints "&amp;foo;" createInternalSubset $dtd = $document->createInternalSubset( $rootnode, $public, $system); This function creates and adds an internal subset to the given document. Because the function automatically adds the DTD to the document there is no need to add the created node explicitly to the document. my $document = XML::LibXML::Document->new(); my $dtd = $document->createInternalSubset( "foo", undef, "foo.dtd" ); will result in the following XML document: <?xml version="1.0"?> <!DOCTYPE foo SYSTEM "foo.dtd"> By setting the public parameter it is possible to set PUBLIC DTDs to a given document. So my $document = XML::LibXML::Document->new(); my $dtd = $document->createInternalSubset( "foo", "-//FOO//DTD FOO 0.1//EN", undef ); will cause the following declaration to be created on the document: <?xml version="1.0"?> <!DOCTYPE foo PUBLIC "-//FOO//DTD FOO 0.1//EN"> createExternalSubset $dtd = $document->createExternalSubset( $rootnode_name, $publicId, $systemId); This function is similar to createInternalSubset() but this DTD is considered to be external and is therefore not added to the document itself. Nevertheless it can be used for validation purposes. importNode $document->importNode( $node ); If a node is not part of a document, it can be imported to another document. As specified in DOM Level 2 Specification the Node will not be altered or removed from its original document ($node->cloneNode(1) will get called implicitly). NOTE: Don't try to use importNode() to import sub-trees that contain an entity reference - even if the entity reference is the root node of the sub-tree. This will cause serious problems to your program. This is a limitation of libxml2 and not of XML::LibXML itself. adoptNode $document->adoptNode( $node ); If a node is not part of a document, it can be imported to another document. As specified in DOM Level 3 Specification the Node will not be altered but it will removed from its original document. After a document adopted a node, the node, its attributes and all its descendants belong to the new document. Because the node does not belong to the old document, it will be unlinked from its old location first. NOTE: Don't try to adoptNode() to import sub-trees that contain entity references - even if the entity reference is the root node of the sub-tree. This will cause serious problems to your program. This is a limitation of libxml2 and not of XML::LibXML itself. externalSubset my $dtd = $doc->externalSubset; If a document has an external subset defined it will be returned by this function. NOTE Dtd nodes are no ordinary nodes in libxml2. The support for these nodes in XML::LibXML is still limited. In particular one may not want use common node function on doctype declaration nodes! internalSubset my $dtd = $doc->internalSubset; If a document has an internal subset defined it will be returned by this function. NOTE Dtd nodes are no ordinary nodes in libxml2. The support for these nodes in XML::LibXML is still limited. In particular one may not want use common node function on doctype declaration nodes! setExternalSubset $doc->setExternalSubset($dtd); EXPERIMENTAL! This method sets a DTD node as an external subset of the given document. setInternalSubset $doc->setInternalSubset($dtd); EXPERIMENTAL! This method sets a DTD node as an internal subset of the given document. removeExternalSubset my $dtd = $doc->removeExternalSubset(); EXPERIMENTAL! If a document has an external subset defined it can be removed from the document by using this function. The removed dtd node will be returned. removeInternalSubset my $dtd = $doc->removeInternalSubset(); EXPERIMENTAL! If a document has an internal subset defined it can be removed from the document by using this function. The removed dtd node will be returned. getElementsByTagName my @nodelist = $doc->getElementsByTagName($tagname); Implements the DOM Level 2 function In SCALAR context this function returns an XML::LibXML::NodeList object. getElementsByTagNameNS my @nodelist = $doc->getElementsByTagNameNS($nsURI,$tagname); Implements the DOM Level 2 function In SCALAR context this function returns an XML::LibXML::NodeList object. getElementsByLocalName my @nodelist = $doc->getElementsByLocalName($localname); This allows the fetching of all nodes from a given document with the given Localname. In SCALAR context this function returns an XML::LibXML::NodeList object. getElementById my $node = $doc->getElementById($id); Returns the element that has an ID attribute with the given value. If no such element exists, this returns undef. Note: the ID of an element may change while manipulating the document. For documents with a DTD, the information about ID attributes is only available if DTD loading/validation has been requested. For HTML documents parsed with the HTML parser ID detection is done automatically. In XML documents, all "xml:id" attributes are considered to be of type ID. You can test ID-ness of an attribute node with $attr->isId(). In versions 1.59 and earlier this method was called getElementsById() (plural) by mistake. Starting from 1.60 this name is maintained as an alias only for backward compatibility. indexElements $dom->indexElements(); This function causes libxml2 to stamp all elements in a document with their document position index which considerably speeds up XPath queries for large documents. It should only be used with static documents that won't be further changed by any DOM methods, because once a document is indexed, XPath will always prefer the index to other methods of determining the document order of nodes. XPath could therefore return improperly ordered node-lists when applied on a document that has been changed after being indexed. It is of course possible to use this method to re-index a modified document before using it with XPath again. This function is not a part of the DOM specification. This function returns number of elements indexed, -1 if error occurred, or -2 if this feature is not available in the running libxml2. Abstract Base Class of XML::LibXML Nodes XML::LibXML::Node Synopsis use XML::LibXML; Description XML::LibXML::Node defines functions that are common to all Node Types. An XML::LibXML::Node should never be created standalone, but as an instance of a high level class such as XML::LibXML::Element or XML::LibXML::Text. The class itself should provide only common functionality. In XML::LibXML each node is part either of a document or a document-fragment. Because of this there is no node without a parent. This may causes confusion with "unbound" nodes. Methods Many functions listed here are extensively documented in the DOM Level 3 specification. Please refer to the specification for extensive documentation. nodeName $name = $node->nodeName; Returns the node's name. This function is aware of namespaces and returns the full name of the current node (prefix:localname). Since 1.62 this function also returns the correct DOM names for node types with constant names, namely: #text, #cdata-section, #comment, #document, #document-fragment. setNodeName $node->setNodeName( $newName ); In very limited situations, it is useful to change a nodes name. In the DOM specification this should throw an error. This Function is aware of namespaces. isSameNode $bool = $node->isSameNode( $other_node ); returns TRUE (1) if the given nodes refer to the same node structure, otherwise FALSE (0) is returned. isEqual $bool = $node->isEqual( $other_node ); deprecated version of isSameNode(). NOTE isEqual will change behaviour to follow the DOM specification unique_key $num = $node->unique_key; This function is not specified for any DOM level. It returns a key guaranteed to be unique for this node, and to always be the same value for this node. In other words, two node objects return the same key if and only if isSameNode indicates that they are the same node. The returned key value is useful as a key in hashes. nodeValue $content = $node->nodeValue; If the node has any content (such as stored in a text node) it can get requested through this function. NOTE: Element Nodes have no content per definition. To get the text value of an Element use textContent() instead! textContent $content = $node->textContent; this function returns the content of all text nodes in the descendants of the given node as specified in DOM. nodeType $type = $node->nodeType; Return a numeric value representing the node type of this node. The module XML::LibXML by default exports constants for the node types (see the EXPORT section in the manual page). unbindNode $node->unbindNode(); Unbinds the Node from its siblings and Parent, but not from the Document it belongs to. If the node is not inserted into the DOM afterwards, it will be lost after the program terminates. From a low level view, the unbound node is stripped from the context it is and inserted into a (hidden) document-fragment. removeChild $childnode = $node->removeChild( $childnode ); This will unbind the Child Node from its parent $node. The function returns the unbound node. If $childnode is not a child of the given Node the function will fail. replaceChild $oldnode = $node->replaceChild( $newNode, $oldNode ); Replaces the $oldNode with the $newNode. The $oldNode will be unbound from the Node. This function differs from the DOM L2 specification, in the case, if the new node is not part of the document, the node will be imported first. replaceNode $node->replaceNode($newNode); This function is very similar to replaceChild(), but it replaces the node itself rather than a childnode. This is useful if a node found by any XPath function, should be replaced. appendChild $childnode = $node->appendChild( $childnode ); The function will add the $childnode to the end of $node's children. The function should fail, if the new childnode is already a child of $node. This function differs from the DOM L2 specification, in the case, if the new node is not part of the document, the node will be imported first. addChild $childnode = $node->addChild( $childnode ); As an alternative to appendChild() one can use the addChild() function. This function is a bit faster, because it avoids all DOM conformity checks. Therefore this function is quite useful if one builds XML documents in memory where the order and ownership (ownerDocument) is assured. addChild() uses libxml2's own xmlAddChild() function. Thus it has to be used with extra care: If a text node is added to a node and the node itself or its last childnode is as well a text node, the node to add will be merged with the one already available. The current node will be removed from memory after this action. Because perl is not aware of this action, the perl instance is still available. XML::LibXML will catch the loss of a node and refuse to run any function called on that node. my $t1 = $doc->createTextNode( "foo" ); my $t2 = $doc->createTextNode( "bar" ); $t1->addChild( $t2 ); # is OK my $val = $t2->nodeValue(); # will fail, script dies Also addChild() will not check if the added node belongs to the same document as the node it will be added to. This could lead to inconsistent documents and in more worse cases even to memory violations, if one does not keep track of this issue. Although this sounds like a lot of trouble, addChild() is useful if a document is built from a stream, such as happens sometimes in SAX handlers or filters. If you are not sure about the source of your nodes, you better stay with appendChild(), because this function is more user friendly in the sense of being more error tolerant. addNewChild $node = $parent->addNewChild( $nsURI, $name ); Similar to addChild(), this function uses low level libxml2 functionality to provide faster interface for DOM building. addNewChild() uses xmlNewChild() to create a new node on a given parent element. addNewChild() has two parameters $nsURI and $name, where $nsURI is an (optional) namespace URI. $name is the fully qualified element name; addNewChild() will determine the correct prefix if necessary. The function returns the newly created node. This function is very useful for DOM building, where a created node can be directly associated with its parent. NOTE this function is not part of the DOM specification and its use will limit your code to XML::LibXML. addSibling $node->addSibling($newNode); addSibling() allows adding an additional node to the end of a nodelist, defined by the given node. cloneNode $newnode =$node->cloneNode( $deep ); cloneNode creates a copy of $node. When $deep is set to 1 (true) the function will copy all child nodes as well. If $deep is 0 only the current node will be copied. Note that in case of element, attributes are copied even if $deep is 0. Note that the behavior of this function for $deep=0 has changed in 1.62 in order to be consistent with the DOM spec (in older versions attributes and namespace information was not copied for elements). parentNode $parentnode = $node->parentNode; Returns simply the Parent Node of the current node. nextSibling $nextnode = $node->nextSibling(); Returns the next sibling if any . nextNonBlankSibling $nextnode = $node->nextNonBlankSibling(); Returns the next non-blank sibling if any (a node is blank if it is a Text or CDATA node consisting of whitespace only). This method is not defined by DOM. previousSibling $prevnode = $node->previousSibling(); Analogous to getNextSibling the function returns the previous sibling if any. previousNonBlankSibling $prevnode = $node->previousNonBlankSibling(); Returns the previous non-blank sibling if any (a node is blank if it is a Text or CDATA node consisting of whitespace only). This method is not defined by DOM. hasChildNodes $boolean = $node->hasChildNodes(); If the current node has child nodes this function returns TRUE (1), otherwise it returns FALSE (0, not undef). firstChild $childnode = $node->firstChild; If a node has child nodes this function will return the first node in the child list. lastChild $childnode = $node->lastChild; If the $node has child nodes this function returns the last child node. ownerDocument $documentnode = $node->ownerDocument; Through this function it is always possible to access the document the current node is bound to. getOwner $node = $node->getOwner; This function returns the node the current node is associated with. In most cases this will be a document node or a document fragment node. setOwnerDocument $node->setOwnerDocument( $doc ); This function binds a node to another DOM. This method unbinds the node first, if it is already bound to another document. This function is the opposite calling of 's adoptNode() function. Because of this it has the same limitations with Entity References as adoptNode(). insertBefore $node->insertBefore( $newNode, $refNode ); The method inserts $newNode before $refNode. If $refNode is undefined, the newNode will be set as the new last child of the parent node. This function differs from the DOM L2 specification, in the case, if the new node is not part of the document, the node will be imported first, automatically. $refNode has to be passed to the function even if it is undefined: $node->insertBefore( $newNode, undef ); # the same as $node->appendChild( $newNode ); $node->insertBefore( $newNode ); # wrong Note, that the reference node has to be a direct child of the node the function is called on. Also, $newChild is not allowed to be an ancestor of the new parent node. insertAfter $node->insertAfter( $newNode, $refNode ); The method inserts $newNode after $refNode. If $refNode is undefined, the newNode will be set as the new last child of the parent node. Note, that $refNode has to be passed explicitly even if it is undef. findnodes @nodes = $node->findnodes( $xpath_expression ); findnodes evaluates the xpath expression (XPath 1.0) on the current node and returns the resulting node set as an array. In scalar context, returns an XML::LibXML::NodeList object. The xpath expression can be passed either as a string, or as a XML::LibXML::XPathExpression object. NOTE ON NAMESPACES AND XPATH: A common mistake about XPath is to assume that node tests consisting of an element name with no prefix match elements in the default namespace. This assumption is wrong - by XPath specification, such node tests can only match elements that are in no (i.e. null) namespace. So, for example, one cannot match the root element of an XHTML document with $node->find('/html') since '/html' would only match if the root element <html> had no namespace, but all XHTML elements belong to the namespace http://www.w3.org/1999/xhtml. (Note that xmlns="..." namespace declarations can also be specified in a DTD, which makes the situation even worse, since the XML document looks as if there was no default namespace). There are several possible ways to deal with namespaces in XPath: The recommended way is to use the module to define an explicit context for XPath evaluation, in which a document independent prefix-to-namespace mapping can be defined. For example: my $xpc = XML::LibXML::XPathContext->new; $xpc->registerNs('x', 'http://www.w3.org/1999/xhtml'); $xpc->find('/x:html',$node); Another possibility is to use prefixes declared in the queried document (if known). If the document declares a prefix for the namespace in question (and the context node is in the scope of the declaration), XML::LibXML allows you to use the prefix in the XPath expression, e.g.: $node->find('/x:html'); See also XML::LibXML::XPathContext->findnodes. find $result = $node->find( $xpath ); find evaluates the XPath 1.0 expression using the current node as the context of the expression, and returns the result depending on what type of result the XPath expression had. For example, the XPath "1 * 3 + 52" results in a XML::LibXML::Number object being returned. Other expressions might return an XML::LibXML::Boolean object, or an XML::LibXML::Literal object (a string). Each of those objects uses Perl's overload feature to "do the right thing" in different contexts. The xpath expression can be passed either as a string, or as a XML::LibXML::XPathExpression object. See also ->find. findvalue print $node->findvalue( $xpath ); findvalue is exactly equivalent to: $node->find( $xpath )->to_literal; That is, it returns the literal value of the results. This enables you to ensure that you get a string back from your search, allowing certain shortcuts. This could be used as the equivalent of XSLT's <xsl:value-of select="some_xpath"/>. See also ->findvalue. The xpath expression can be passed either as a string, or as a XML::LibXML::XPathExpression object. exists $bool = $node->exists( $xpath_expression ); This method behaves like findnodes, except that it only returns a boolean value (1 if the expression matches a node, 0 otherwise) and may be faster than findnodes, because the XPath evaluation may stop early on the first match (this is true for libxml2 >= 2.6.27). For XPath expressions that do not return node-set, the method returns true if the returned value is a non-zero number or a non-empty string. childNodes @childnodes = $node->childNodes(); childNodes implements a more intuitive interface to the childnodes of the current node. It enables you to pass all children directly to a map or grep. If this function is called in scalar context, a XML::LibXML::NodeList object will be returned. nonBlankChildNodes @childnodes = $node->nonBlankChildNodes(); This is like childNodes, but returns only non-blank nodes (where a node is blank if it is a Text or CDATA node consisting of whitespace only). This method is not defined by DOM. toString $xmlstring = $node->toString($format,$docencoding); This method is similar to the method toString of a but for a single node. It returns a string consisting of XML serialization of the given node and all its descendants. Unlike XML::LibXML::Document::toString, in this case the resulting string is by default a character string (UTF-8 encoded with UTF8 flag on). An optional flag $format controls indentation, as in XML::LibXML::Document::toString. If the second optional $docencoding flag is true, the result will be a byte string in the document encoding (see XML::LibXML::Document::actualEncoding). toStringC14N $c14nstring = $node->toStringC14N(); $c14nstring = $node->toStringC14N($with_comments, $xpath_expression , $xpath_context); The function is similar to toString(). Instead of simply serializing the document tree, it transforms it as it is specified in the XML-C14N Specification (see http://www.w3.org/TR/xml-c14n). Such transformation is known as canonization. If $with_comments is 0 or not defined, the result-document will not contain any comments that exist in the original document. To include comments into the canonized document, $with_comments has to be set to 1. The parameter $xpath_expression defines the nodeset of nodes that should be visible in the resulting document. This can be used to filter out some nodes. One has to note, that only the nodes that are part of the nodeset, will be included into the result-document. Their child-nodes will not exist in the resulting document, unless they are part of the nodeset defined by the xpath expression. If $xpath_expression is omitted or empty, toStringC14N() will include all nodes in the given sub-tree, using the following XPath expressions: with comments (. | .//node() | .//@* | .//namespace::*) and without comments (. | .//node() | .//@* | .//namespace::*)[not(self::comment())] An optional parameter $xpath_context can be used to pass an object defining the context for evaluation of $xpath_expression. This is useful for mapping namespace prefixes used in the XPath expression to namespace URIs. Note, however, that $node will be used as the context node for the evaluation, not the context node of $xpath_context! toStringC14N_v1_1 $c14nstring = $node->toStringC14N_v1_1(); $c14nstring = $node->toStringC14N_v1_1($with_comments, $xpath_expression , $xpath_context); This function behaves like toStringC14N() except that it uses the "XML_C14N_1_1" constant for canonicalising using the "C14N 1.1 spec". toStringEC14N $ec14nstring = $node->toStringEC14N(); $ec14nstring = $node->toStringEC14N($with_comments, $xpath_expression, $inclusive_prefix_list); $ec14nstring = $node->toStringEC14N($with_comments, $xpath_expression, $xpath_context, $inclusive_prefix_list); The function is similar to toStringC14N() but follows the XML-EXC-C14N Specification (see http://www.w3.org/TR/xml-exc-c14n) for exclusive canonization of XML. The arguments $with_comments, $xpath_expression, $xpath_context are as in toStringC14N(). An ARRAY reference can be passed as the last argument $inclusive_prefix_list, listing namespace prefixes that are to be handled in the manner described by the Canonical XML Recommendation (i.e. preserved in the output even if the namespace is not used). C.f. the spec for details. serialize $str = $doc->serialize($format); An alias for toString(). This function was name added to be more consistent with libxml2. serialize_c14n An alias for toStringC14N(). serialize_exc_c14n An alias for toStringEC14N(). localname $localname = $node->localname; Returns the local name of a tag. This is the part behind the colon. prefix $nameprefix = $node->prefix; Returns the prefix of a tag. This is the part before the colon. namespaceURI $uri = $node->namespaceURI(); returns the URI of the current namespace. hasAttributes $boolean = $node->hasAttributes(); returns 1 (TRUE) if the current node has any attributes set, otherwise 0 (FALSE) is returned. attributes @attributelist = $node->attributes(); This function returns all attributes and namespace declarations assigned to the given node. Because XML::LibXML does not implement namespace declarations and attributes the same way, it is required to test what kind of node is handled while accessing the functions result. If this function is called in array context the attribute nodes are returned as an array. In scalar context, the function will return a XML::LibXML::NamedNodeMap object. lookupNamespaceURI $URI = $node->lookupNamespaceURI( $prefix ); Find a namespace URI by its prefix starting at the current node. lookupNamespacePrefix $prefix = $node->lookupNamespacePrefix( $URI ); Find a namespace prefix by its URI starting at the current node. NOTE Only the namespace URIs are meant to be unique. The prefix is only document related. Also the document might have more than a single prefix defined for a namespace. normalize $node->normalize; This function normalizes adjacent text nodes. This function is not as strict as libxml2's xmlTextMerge() function, since it will not free a node that is still referenced by the perl layer. getNamespaces @nslist = $node->getNamespaces; If a node has any namespaces defined, this function will return these namespaces. Note, that this will not return all namespaces that are in scope, but only the ones declared explicitly for that node. Although getNamespaces is available for all nodes, it only makes sense if used with element nodes. removeChildNodes $node->removeChildNodes(); This function is not specified for any DOM level: It removes all childnodes from a node in a single step. Other than the libxml2 function itself (xmlFreeNodeList), this function will not immediately remove the nodes from the memory. This saves one from getting memory violations, if there are nodes still referred to from the Perl level. baseURI () $strURI = $node->baseURI(); Searches for the base URL of the node. The method should work on both XML and HTML documents even if base mechanisms for these are completely different. It returns the base as defined in RFC 2396 sections "5.1.1. Base URI within Document Content" and "5.1.2. Base URI from the Encapsulating Entity". However it does not return the document base (5.1.3), use method URI of XML::LibXML::Document for this. setBaseURI ($strURI) $node->setBaseURI($strURI); This method only does something useful for an element node in an XML document. It sets the xml:base attribute on the node to $strURI, which effectively sets the base URI of the node to the same value. Note: For HTML documents this behaves as if the document was XML which may not be desired, since it does not effectively set the base URI of the node. See RFC 2396 appendix D for an example of how base URI can be specified in HTML. nodePath $node->nodePath(); This function is not specified for any DOM level: It returns a canonical structure based XPath for a given node. line_number $lineno = $node->line_number(); This function returns the line number where the tag was found during parsing. If a node is added to the document the line number is 0. Problems may occur, if a node from one document is passed to another one. IMPORTANT: Due to limitations in the libxml2 library line numbers greater than 65535 will be returned as 65535. Please see http://bugzilla.gnome.org/show_bug.cgi?id=325533 for more details. Note: line_number() is special to XML::LibXML and not part of the DOM specification. If the line_numbers flag of the parser was not activated before parsing, line_number() will always return 0. XML::LibXML Class for Element Nodes XML::LibXML::Element Synopsis use XML::LibXML; # Only methods specific to Element nodes are listed here, # see the XML::LibXML::Node manpage for other methods Methods The class inherits from . The documentation for Inherited methods is not listed here. Many functions listed here are extensively documented in the DOM Level 3 specification. Please refer to the specification for extensive documentation. new $node = XML::LibXML::Element->new( $name ); This function creates a new node unbound to any DOM. setAttribute $node->setAttribute( $aname, $avalue ); This method sets or replaces the node's attribute $aname to the value $avalue setAttributeNS $node->setAttributeNS( $nsURI, $aname, $avalue ); Namespace-aware version of setAttribute, where $nsURI is a namespace URI, $aname is a qualified name, and $avalue is the value. The namespace URI may be null (empty or undefined) in order to create an attribute which has no namespace. The current implementation differs from DOM in the following aspects If an attribute with the same local name and namespace URI already exists on the element, but its prefix differs from the prefix of $aname, then this function is supposed to change the prefix (regardless of namespace declarations and possible collisions). However, the current implementation does rather the opposite. If a prefix is declared for the namespace URI in the scope of the attribute, then the already declared prefix is used, disregarding the prefix specified in $aname. If no prefix is declared for the namespace, the function tries to declare the prefix specified in $aname and dies if the prefix is already taken by some other namespace. According to DOM Level 2 specification, this method can also be used to create or modify special attributes used for declaring XML namespaces (which belong to the namespace "http://www.w3.org/2000/xmlns/" and have prefix or name "xmlns"). This should work since version 1.61, but again the implementation differs from DOM specification in the following: if a declaration of the same namespace prefix already exists on the element, then changing its value via this method automatically changes the namespace of all elements and attributes in its scope. This is because in libxml2 the namespace URI of an element is not static but is computed from a pointer to a namespace declaration attribute. getAttribute $avalue = $node->getAttribute( $aname ); If $node has an attribute with the name $aname, the value of this attribute will get returned. getAttributeNS $avalue = $node->getAttributeNS( $nsURI, $aname ); Retrieves an attribute value by local name and namespace URI. getAttributeNode $attrnode = $node->getAttributeNode( $aname ); Retrieve an attribute node by name. If no attribute with a given name exists, undef is returned. getAttributeNodeNS $attrnode = $node->getAttributeNodeNS( $namespaceURI, $aname ); Retrieves an attribute node by local name and namespace URI. If no attribute with a given localname and namespace exists, undef is returned. removeAttribute $node->removeAttribute( $aname ); The method removes the attribute $aname from the node's attribute list, if the attribute can be found. removeAttributeNS $node->removeAttributeNS( $nsURI, $aname ); Namespace version of removeAttribute hasAttribute $boolean = $node->hasAttribute( $aname ); This function tests if the named attribute is set for the node. If the attribute is specified, TRUE (1) will be returned, otherwise the return value is FALSE (0). hasAttributeNS $boolean = $node->hasAttributeNS( $nsURI, $aname ); namespace version of hasAttribute getChildrenByTagName @nodes = $node->getChildrenByTagName($tagname); The function gives direct access to all child elements of the current node with a given tagname, where tagname is a qualified name, that is, in case of namespace usage it may consist of a prefix and local name. This function makes things a lot easier if one needs to handle big data sets. A special tagname '*' can be used to match any name. If this function is called in SCALAR context, it returns the number of elements found. getChildrenByTagNameNS @nodes = $node->getChildrenByTagNameNS($nsURI,$tagname); Namespace version of getChildrenByTagName. A special nsURI '*' matches any namespace URI, in which case the function behaves just like getChildrenByLocalName. If this function is called in SCALAR context, it returns the number of elements found. getChildrenByLocalName @nodes = $node->getChildrenByLocalName($localname); The function gives direct access to all child elements of the current node with a given local name. It makes things a lot easier if one needs to handle big data sets. A special localname '*' can be used to match any local name. If this function is called in SCALAR context, it returns the number of elements found. getElementsByTagName @nodes = $node->getElementsByTagName($tagname); This function is part of the spec. It fetches all descendants of a node with a given tagname, where tagname is a qualified name, that is, in case of namespace usage it may consist of a prefix and local name. A special tagname '*' can be used to match any tag name. In SCALAR context this function returns an XML::LibXML::NodeList object. getElementsByTagNameNS @nodes = $node->getElementsByTagNameNS($nsURI,$localname); Namespace version of getElementsByTagName as found in the DOM spec. A special localname '*' can be used to match any local name and nsURI '*' can be used to match any namespace URI. In SCALAR context this function returns an XML::LibXML::NodeList object. getElementsByLocalName @nodes = $node->getElementsByLocalName($localname); This function is not found in the DOM specification. It is a mix of getElementsByTagName and getElementsByTagNameNS. It will fetch all tags matching the given local-name. This allows one to select tags with the same local name across namespace borders. In SCALAR context this function returns an XML::LibXML::NodeList object. appendWellBalancedChunk $node->appendWellBalancedChunk( $chunk ); Sometimes it is necessary to append a string coded XML Tree to a node. appendWellBalancedChunk will do the trick for you. But this is only done if the String is well-balanced. Note that appendWellBalancedChunk() is only left for compatibility reasons. Implicitly it uses my $fragment = $parser->parse_balanced_chunk( $chunk ); $node->appendChild( $fragment ); This form is more explicit and makes it easier to control the flow of a script. appendText $node->appendText( $PCDATA ); alias for appendTextNode(). appendTextNode $node->appendTextNode( $PCDATA ); This wrapper function lets you add a string directly to an element node. appendTextChild $node->appendTextChild( $childname , $PCDATA ); Somewhat similar with appendTextNode: It lets you set an Element, that contains only a text node directly by specifying the name and the text content. setNamespace $node->setNamespace( $nsURI , $nsPrefix, $activate ); setNamespace() allows one to apply a namespace to an element. The function takes three parameters: 1. the namespace URI, which is required and the two optional values prefix, which is the namespace prefix, as it should be used in child elements or attributes as well as the additional activate parameter. If prefix is not given, undefined or empty, this function tries to create a declaration of the default namespace. The activate parameter is most useful: If this parameter is set to FALSE (0), a new namespace declaration is simply added to the element while the element's namespace itself is not altered. Nevertheless, activate is set to TRUE (1) on default. In this case the namespace is used as the node's effective namespace. This means the namespace prefix is added to the node name and if there was a namespace already active for the node, it will be replaced (but its declaration is not removed from the document). A new namespace declaration is only created if necessary (that is, if the element is already in the scope of a namespace declaration associating the prefix with the namespace URI, then this declaration is reused). The following example may clarify this: my $e1 = $doc->createElement("bar"); $e1->setNamespace("http://foobar.org", "foo") results <foo:bar xmlns:foo="http://foobar.org"/> while my $e2 = $doc->createElement("bar"); $e2->setNamespace("http://foobar.org", "foo",0) results only <bar xmlns:foo="http://foobar.org"/> By using $activate == 0 it is possible to create multiple namespace declarations on a single element. The function fails if it is required to create a declaration associating the prefix with the namespace URI but the element already carries a declaration with the same prefix but different namespace URI. setNamespaceDeclURI $node->setNamespaceDeclURI( $nsPrefix, $newURI ); EXPERIMENTAL IN 1.61 ! This function manipulates directly with an existing namespace declaration on an element. It takes two parameters: the prefix by which it looks up the namespace declaration and a new namespace URI which replaces its previous value. It returns 1 if the namespace declaration was found and changed, 0 otherwise. All elements and attributes (even those previously unbound from the document) for which the namespace declaration determines their namespace belong to the new namespace after the change. If the new URI is undef or empty, the nodes have no namespace and no prefix after the change. Namespace declarations once nulled in this way do not further appear in the serialized output (but do remain in the document for internal integrity of libxml2 data structures). This function is NOT part of any DOM API. setNamespaceDeclPrefix $node->setNamespaceDeclPrefix( $oldPrefix, $newPrefix ); EXPERIMENTAL IN 1.61 ! This function manipulates directly with an existing namespace declaration on an element. It takes two parameters: the old prefix by which it looks up the namespace declaration and a new prefix which is to replace the old one. The function dies with an error if the element is in the scope of another declaration whose prefix equals to the new prefix, or if the change should result in a declaration with a non-empty prefix but empty namespace URI. Otherwise, it returns 1 if the namespace declaration was found and changed and 0 if not found. All elements and attributes (even those previously unbound from the document) for which the namespace declaration determines their namespace change their prefix to the new value. If the new prefix is undef or empty, the namespace declaration becomes a declaration of a default namespace. The corresponding nodes drop their namespace prefix (but remain in the, now default, namespace). In this case the function fails, if the containing element is in the scope of another default namespace declaration. This function is NOT part of any DOM API. Overloading XML::LibXML::Element overloads hash dereferencing to provide access to the element's attributes. For non-namespaced attributes, the attribute name is the hash key, and the attribute value is the hash value. For namespaced attributes, the hash key is qualified with the namespace URI, using Clark notation. Perl's "tied hash" feature is used, which means that the hash gives you read-write access to the element's attributes. For more information, see XML::LibXML::AttributeHash XML::LibXML Class for Text Nodes XML::LibXML::Text Synopsis use XML::LibXML; # Only methods specific to Text nodes are listed here, # see the XML::LibXML::Node manpage for other methods Description Unlike the DOM specification, XML::LibXML implements the text node as the base class of all character data node. Therefore there exists no CharacterData class. This allows one to apply methods of text nodes also to Comments and CDATA-sections. Methods The class inherits from . The documentation for Inherited methods is not listed here. Many functions listed here are extensively documented in the DOM Level 3 specification. Please refer to the specification for extensive documentation. new $text = XML::LibXML::Text->new( $content ); The constructor of the class. It creates an unbound text node. data $nodedata = $text->data; Although there exists the nodeValue attribute in the Node class, the DOM specification defines data as a separate attribute. XML::LibXML implements these two attributes not as different attributes, but as aliases, such as libxml2 does. Therefore $text->data; and $text->nodeValue; will have the same result and are not different entities. setData($string) $text->setData( $text_content ); This function sets or replaces text content to a node. The node has to be of the type "text", "cdata" or "comment". substringData($offset,$length) $text->substringData($offset, $length); Extracts a range of data from the node. (DOM Spec) This function takes the two parameters $offset and $length and returns the sub-string, if available. If the node contains no data or $offset refers to an non-existing string index, this function will return undef. If $length is out of range substringData will return the data starting at $offset instead of causing an error. appendData($string) $text->appendData( $somedata ); Appends a string to the end of the existing data. If the current text node contains no data, this function has the same effect as setData. insertData($offset,$string) $text->insertData($offset, $string); Inserts the parameter $string at the given $offset of the existing data of the node. This operation will not remove existing data, but change the order of the existing data. The $offset has to be a positive value. If $offset is out of range, insertData will have the same behaviour as appendData. deleteData($offset, $length) $text->deleteData($offset, $length); This method removes a chunk from the existing node data at the given offset. The $length parameter tells, how many characters should be removed from the string. deleteDataString($string, [$all]) $text->deleteDataString($remstring, $all); This method removes a chunk from the existing node data. Since the DOM spec is quite unhandy if you already know which string to remove from a text node, this method allows more perlish code :) The functions takes two parameters: $string and optional the $all flag. If $all is not set, undef or 0, deleteDataString will remove only the first occurrence of $string. If $all is TRUE deleteDataString will remove all occurrences of $string from the node data. replaceData($offset, $length, $string) $text->replaceData($offset, $length, $string); The DOM style version to replace node data. replaceDataString($oldstring, $newstring, [$all]) $text->replaceDataString($old, $new, $flag); The more programmer friendly version of replaceData() :) Instead of giving offsets and length one can specify the exact string ($oldstring) to be replaced. Additionally the $all flag allows one to replace all occurrences of $oldstring. replaceDataRegEx( $search_cond, $replace_cond, $reflags ) $text->replaceDataRegEx( $search_cond, $replace_cond, $reflags ); This method replaces the node's data by a simple regular expression. Optional, this function allows one to pass some flags that will be added as flag to the replace statement. NOTE: This is a shortcut for my $datastr = $node->getData(); $datastr =~ s/somecond/replacement/g; # 'g' is just an example for any flag $node->setData( $datastr ); This function can make things easier to read for simple replacements. For more complex variants it is recommended to use the code snippet above. XML::LibXML Comment Class XML::LibXML::Comment Synopsis use XML::LibXML; # Only methods specific to Comment nodes are listed here, # see the XML::LibXML::Node manpage for other methods Description This class provides all functions of , but for comment nodes. This can be done, since only the output of the node types is different, but not the data structure. :-) Methods The class inherits from . The documentation for Inherited methods is not listed here. Many functions listed here are extensively documented in the DOM Level 3 specification. Please refer to the specification for extensive documentation. new $node = XML::LibXML::Comment->new( $content ); The constructor is the only provided function for this package. It is required, because libxml2 treats text nodes and comment nodes slightly differently. XML::LibXML Class for CDATA Sections XML::LibXML::CDATASection Synopsis use XML::LibXML; # Only methods specific to CDATA nodes are listed here, # see the XML::LibXML::Node manpage for other methods Description This class provides all functions of , but for CDATA nodes. Methods The class inherits from . The documentation for Inherited methods is not listed here. Many functions listed here are extensively documented in the DOM Level 3 specification. Please refer to the specification for extensive documentation. new $node = XML::LibXML::CDATASection->new( $content ); The constructor is the only provided function for this package. It is required, because libxml2 treats the different text node types slightly differently. XML::LibXML Attribute Class XML::LibXML::Attr Synopsis use XML::LibXML; # Only methods specific to Attribute nodes are listed here, # see the XML::LibXML::Node manpage for other methods Description This is the interface to handle Attributes like ordinary nodes. The naming of the class relies on the W3C DOM documentation. Methods The class inherits from . The documentation for Inherited methods is not listed here. Many functions listed here are extensively documented in the DOM Level 3 specification. Please refer to the specification for extensive documentation. new $attr = XML::LibXML::Attr->new($name [,$value]); Class constructor. If you need to work with ISO encoded strings, you should always use the createAttribute of . getValue $string = $attr->getValue(); Returns the value stored for the attribute. If undef is returned, the attribute has no value, which is different of being not specified. value $string = $attr->value; Alias for getValue() setValue $attr->setValue( $string ); This is needed to set a new attribute value. If ISO encoded strings are passed as parameter, the node has to be bound to a document, otherwise the encoding might be done incorrectly. getOwnerElement $node = $attr->getOwnerElement(); returns the node the attribute belongs to. If the attribute is not bound to a node, undef will be returned. Overwriting the underlying implementation, the parentNode function will return undef, instead of the owner element. setNamespace $attr->setNamespace($nsURI, $prefix); This function tries to bound the attribute to a given namespace. If $nsURI is undefined or empty, the function discards any previous association of the attribute with a namespace. If the namespace was not previously declared in the context of the attribute, this function will fail. In this case you may wish to call setNamespace() on the ownerElement. If the namespace URI is non-empty and declared in the context of the attribute, but only with a different (non-empty) prefix, then the attribute is still bound to the namespace but gets a different prefix than $prefix. The function also fails if the prefix is empty but the namespace URI is not (because unprefixed attributes should by definition belong to no namespace). This function returns 1 on success, 0 otherwise. isId $bool = $attr->isId; Determine whether an attribute is of type ID. For documents with a DTD, this information is only available if DTD loading/validation has been requested. For HTML documents parsed with the HTML parser ID detection is done automatically. In XML documents, all "xml:id" attributes are considered to be of type ID. serializeContent($docencoding) $string = $attr->serializeContent; This function is not part of DOM API. It returns attribute content in the form in which it serializes into XML, that is with all meta-characters properly quoted and with raw entity references (except for entities expanded during parse time). Setting the optional $docencoding flag to 1 enforces document encoding for the output string (which is then passed to Perl as a byte string). Otherwise the string is passed to Perl as (UTF-8 encoded) characters. XML::LibXML's DOM L2 Document Fragment Implementation XML::LibXML::DocumentFragment Synopsis use XML::LibXML; Description This class is a helper class as described in the DOM Level 2 Specification. It is implemented as a node without name. All adding, inserting or replacing functions are aware of document fragments now. As well all unbound nodes (all nodes that do not belong to any document sub-tree) are implicit members of document fragments. XML::LibXML Namespace Implementation XML::LibXML::Namespace Synopsis use XML::LibXML; # Only methods specific to Namespace nodes are listed here, # see the XML::LibXML::Node manpage for other methods Description Namespace nodes are returned by both $element->findnodes('namespace::foo') or by $node->getNamespaces(). The namespace node API is not part of any current DOM API, and so it is quite minimal. It should be noted that namespace nodes are not a sub class of , however Namespace nodes act a lot like attribute nodes, and similarly named methods will return what you would expect if you treated the namespace node as an attribute. Note that in order to fix several inconsistencies between the API and the documentation, the behavior of some functions have been changed in 1.64. Methods new my $ns = XML::LibXML::Namespace->new($nsURI); Creates a new Namespace node. Note that this is not a 'node' as an attribute or an element node. Therefore you can't do call all Functions. All functions available for this node are listed below. Optionally you can pass the prefix to the namespace constructor. If this second parameter is omitted you will create a so called default namespace. Note, the newly created namespace is not bound to any document or node, therefore you should not expect it to be available in an existing document. declaredURI Returns the URI for this namespace. declaredPrefix Returns the prefix for this namespace. nodeName print $ns->nodeName(); Returns "xmlns:prefix", where prefix is the prefix for this namespace. name print $ns->name(); Alias for nodeName() getLocalName $localname = $ns->getLocalName(); Returns the local name of this node as if it were an attribute, that is, the prefix associated with the namespace. getData print $ns->getData(); Returns the URI of the namespace, i.e. the value of this node as if it were an attribute. getValue print $ns->getValue(); Alias for getData() value print $ns->value(); Alias for getData() getNamespaceURI $known_uri = $ns->getNamespaceURI(); Returns the string "http://www.w3.org/2000/xmlns/" getPrefix $known_prefix = $ns->getPrefix(); Returns the string "xmlns" unique_key $key = $ns->unique_key(); This method returns a key guaranteed to be unique for this namespace, and to always be the same value for this namespace. Two namespace objects return the same key if and only if they have the same prefix and the same URI. The returned key value is useful as a key in hashes. XML::LibXML Processing Instructions XML::LibXML::PI Synopsis use XML::LibXML; # Only methods specific to Processing Instruction nodes are listed here, # see the XML::LibXML::Node manpage for other methods Description Processing instructions are implemented with XML::LibXML with read and write access. The PI data is the PI without the PI target (as specified in XML 1.0 [17]) as a string. This string can be accessed with getData as implemented in . The write access is aware about the fact, that many processing instructions have attribute like data. Therefore setData() provides besides the DOM spec conform Interface to pass a set of named parameter. So the code segment my $pi = $dom->createProcessingInstruction("abc"); $pi->setData(foo=>'bar', foobar=>'foobar'); $dom->appendChild( $pi ); will result the following PI in the DOM: <?abc foo="bar" foobar="foobar"?> Which is how it is specified in the DOM specification. This three step interface creates temporary a node in perl space. This can be avoided while using the insertProcessingInstruction() method. Instead of the three calls described above, the call $dom->insertProcessingInstruction("abc",'foo="bar" foobar="foobar"'); will have the same result as above. 's implementation of setData() documented below differs a bit from the standard version as available in : setData $pinode->setData( $data_string ); $pinode->setData( name=>string_value [...] ); This method allows one to change the content data of a PI. Additionally to the interface specified for DOM Level2, the method provides a named parameter interface to set the data. This parameter list is converted into a string before it is appended to the PI. XML::LibXML DTD Handling XML::LibXML::Dtd Synopsis use XML::LibXML; Description This class holds a DTD. You may parse a DTD from either a string, or from an external SYSTEM identifier. No support is available as yet for parsing from a filehandle. XML::LibXML::Dtd is a sub-class of , so all the methods available to nodes (particularly toString()) are available to Dtd objects. Methods new $dtd = XML::LibXML::Dtd->new($public_id, $system_id); Parse a DTD from the system identifier, and return a DTD object that you can pass to $doc->is_valid() or $doc->validate(). my $dtd = XML::LibXML::Dtd->new( "SOME // Public / ID / 1.0", "test.dtd" ); my $doc = XML::LibXML->new->parse_file("test.xml"); $doc->validate($dtd); parse_string $dtd = XML::LibXML::Dtd->parse_string($dtd_str); The same as new() above, except you can parse a DTD from a string. Note that parsing from string may fail if the DTD contains external parametric-entity references with relative URLs. getName $publicId = $dtd->getName(); Returns the name of DTD; i.e., the name immediately following the DOCTYPE keyword. publicId $publicId = $dtd->publicId(); Returns the public identifier of the external subset. systemId $systemId = $dtd->systemId(); Returns the system identifier of the external subset. XML::LibXML Class for Input Callbacks XML::LibXML::InputCallback Synopsis use XML::LibXML; Synopsis my $input_callbacks = XML::LibXML::InputCallback->new(); $input_callbacks->register_callbacks([ $match_cb1, $open_cb1, $read_cb1, $close_cb1 ] ); $input_callbacks->register_callbacks([ $match_cb2, $open_cb2, $read_cb2, $close_cb2 ] ); $input_callbacks->register_callbacks( [ $match_cb3, $open_cb3, $read_cb3, $close_cb3 ] ); $parser->input_callbacks( $input_callbacks ); $parser->parse_file( $some_xml_file ); Description You may get unexpected results if you are trying to load external documents during libxml2 parsing if the location of the resource is not a HTTP, FTP or relative location but a absolute path for example. To get around this limitation, you may add your own input handler to open, read and close particular types of locations or URI classes. Using this input callback handlers, you can handle your own custom URI schemes for example. The input callbacks are used whenever XML::LibXML has to get something other than externally parsed entities from somewhere. They are implemented using a callback stack on the Perl layer in analogy to libxml2's native callback stack. The XML::LibXML::InputCallback class transparently registers the input callbacks for the libxml2's parser processes. How does XML::LibXML::InputCallback work? The libxml2 library offers a callback implementation as global functions only. To work-around the troubles resulting in having only global callbacks - for example, if the same global callback stack is manipulated by different applications running together in a single Apache Web-server environment -, XML::LibXML::InputCallback comes with a object-oriented and a function-oriented part. Using the function-oriented part the global callback stack of libxml2 can be manipulated. Those functions can be used as interface to the callbacks on the C- and XS Layer. At the object-oriented part, operations for working with the "pseudo-localized" callback stack are implemented. Currently, you can register and de-register callbacks on the Perl layer and initialize them on a per parser basis. Callback Groups The libxml2 input callbacks come in groups. One group contains a URI matcher (match), a data stream constructor (open), a data stream reader (read), and a data stream destructor (close). The callbacks can be manipulated on a per group basis only. The Parser Process The parser process works on an XML data stream, along which, links to other resources can be embedded. This can be links to external DTDs or XIncludes for example. Those resources are identified by URIs. The callback implementation of libxml2 assumes that one callback group can handle a certain amount of URIs and a certain URI scheme. Per default, callback handlers for file://*, file:://*.gz, http://* and ftp://* are registered. Callback groups in the callback stack are processed from top to bottom, meaning that callback groups registered later will be processed before the earlier registered ones. While parsing the data stream, the libxml2 parser checks if a registered callback group will handle a URI - if they will not, the URI will be interpreted as file://URI. To handle a URI, the match callback will have to return '1'. If that happens, the handling of the URI will be passed to that callback group. Next, the URI will be passed to the open callback, which should return a reference to the data stream if it successfully opened the file, '0' otherwise. If opening the stream was successful, the read callback will be called repeatedly until it returns an empty string. After the read callback, the close callback will be called to close the stream. Organisation of callback groups in XML::LibXML::InputCallback Callback groups are implemented as a stack (Array), each entry holds a reference to an array of the callbacks. For the libxml2 library, the XML::LibXML::InputCallback callback implementation appears as one single callback group. The Perl implementation however allows one to manage different callback stacks on a per libxml2-parser basis. Using XML::LibXML::InputCallback After object instantiation using the parameter-less constructor, you can register callback groups. my $input_callbacks = XML::LibXML::InputCallback->new(); $input_callbacks->register_callbacks([ $match_cb1, $open_cb1, $read_cb1, $close_cb1 ] ); $input_callbacks->register_callbacks([ $match_cb2, $open_cb2, $read_cb2, $close_cb2 ] ); $input_callbacks->register_callbacks( [ $match_cb3, $open_cb3, $read_cb3, $close_cb3 ] ); $parser->input_callbacks( $input_callbacks ); $parser->parse_file( $some_xml_file ); What about the old callback system prior to XML::LibXML::InputCallback? In XML::LibXML versions prior to 1.59 - i.e. without the XML::LibXML::InputCallback module - you could define your callbacks either using globally or locally. You still can do that using XML::LibXML::InputCallback, and in addition to that you can define the callbacks on a per parser basis! If you use the old callback interface through global callbacks, XML::LibXML::InputCallback will treat them with a lower priority as the ones registered using the new interface. The global callbacks will not override the callback groups registered using the new interface. Local callbacks are attached to a specific parser instance, therefore they are treated with highest priority. If the match callback of the callback group registered as local variable is identical to one of the callback groups registered using the new interface, that callback group will be replaced. Users of the old callback implementation whose open callback returned a plain string, will have to adapt their code to return a reference to that string after upgrading to version >= 1.59. The new callback system can only deal with the open callback returning a reference! Interface Description Global Variables $_CUR_CB Stores the current callback and can be used as shortcut to access the callback stack. @_GLOBAL_CALLBACKS Stores all callback groups for the current parser process. @_CB_STACK Stores the currently used callback group. Used to prevent parser errors when dealing with nested XML data. Global Callbacks _callback_match Implements the interface for the match callback at C-level and for the selection of the callback group from the callbacks defined at the Perl-level. _callback_open Forwards the open callback from libxml2 to the corresponding callback function at the Perl-level. _callback_read Forwards the read request to the corresponding callback function at the Perl-level and returns the result to libxml2. _callback_close Forwards the close callback from libxml2 to the corresponding callback function at the Perl-level.. Class methods new() A simple constructor. register_callbacks( [ $match_cb, $open_cb, $read_cb, $close_cb ]) The four callbacks have to be given as array reference in the above order match, open, read, close! unregister_callbacks( [ $match_cb, $open_cb, $read_cb, $close_cb ]) With no arguments given, unregister_callbacks() will delete the last registered callback group from the stack. If four callbacks are passed as array reference, the callback group to unregister will be identified by the match callback and deleted from the callback stack. Note that if several identical match callbacks are defined in different callback groups, ALL of them will be deleted from the stack. init_callbacks( $parser ) Initializes the callback system for the provided parser before starting a parsing process. cleanup_callbacks() Resets global variables and the libxml2 callback stack. lib_init_callbacks() Used internally for callback registration at C-level. lib_cleanup_callbacks() Used internally for callback resetting at the C-level. Example callbacks The following example is a purely fictitious example that uses a MyScheme::Handler object that responds to methods similar to an IO::Handle. # Define the four callback functions sub match_uri { my $uri = shift; return $uri =~ /^myscheme:/; # trigger our callback group at a 'myscheme' URIs } sub open_uri { my $uri = shift; my $handler = MyScheme::Handler->new($uri); return $handler; } # The returned $buffer will be parsed by the libxml2 parser sub read_uri { my $handler = shift; my $length = shift; my $buffer; read($handler, $buffer, $length); return $buffer; # $buffer will be an empty string '' if read() is done } # Close the handle associated with the resource. sub close_uri { my $handler = shift; close($handler); } # Register them with a instance of XML::LibXML::InputCallback my $input_callbacks = XML::LibXML::InputCallback->new(); $input_callbacks->register_callbacks([ \&match_uri, \&open_uri, \&read_uri, \&close_uri ] ); # Register the callback group at a parser instance $parser->input_callbacks( $input_callbacks ); # $some_xml_file will be parsed using our callbacks $parser->parse_file( $some_xml_file ); RelaxNG Schema Validation XML::LibXML::RelaxNG Synopsis use XML::LibXML; $doc = XML::LibXML->new->parse_file($url); Description The XML::LibXML::RelaxNG class is a tiny frontend to libxml2's RelaxNG implementation. Currently it supports only schema parsing and document validation. Methods new $rngschema = XML::LibXML::RelaxNG->new( location => $filename_or_url, no_network => 1 ); $rngschema = XML::LibXML::RelaxNG->new( string => $xmlschemastring, no_network => 1 ); $rngschema = XML::LibXML::RelaxNG->new( DOM => $doc, no_network => 1 ); The constructor of XML::LibXML::RelaxNG needs to be called with list of parameters. At least location, string or DOM parameter is required to specify source of schema. Optional parameter no_network set to 1 cause that parser would not access network and optional parameter recover set 1 cause that parser would not call die() on errors. It is important, that each schema only have a single source. The location parameter allows one to parse a schema from the filesystem or a (non-HTTPS) URL. The string parameter will parse the schema from the given XML string. The DOM parameter allows one to parse the schema from a pre-parsed . Note that the constructor will die() if the schema does not meed the constraints of the RelaxNG specification. validate eval { $rngschema->validate( $doc ); }; This function allows one to validate a (parsed) document against the given RelaxNG schema. The argument of this function should be an XML::LibXML::Document object. If this function succeeds, it will return 0, otherwise it will die() and report the errors found. Because of this validate() should be always evaluated. XML Schema Validation XML::LibXML::Schema Synopsis use XML::LibXML; $doc = XML::LibXML->new->parse_file($url); Description The XML::LibXML::Schema class is a tiny frontend to libxml2's XML Schema implementation. Currently it supports only schema parsing and document validation. As of 2.6.32, libxml2 only supports decimal types up to 24 digits (the standard requires at least 18). Methods new $xmlschema = XML::LibXML::Schema->new( location => $filename_or_url, no_network => 1 ); $xmlschema = XML::LibXML::Schema->new( string => $xmlschemastring, no_network => 1 ); The constructor of XML::LibXML::Schema needs to be called with list of parameters. At least location or string parameter is required to specify source of schema. Optional parameter no_network set to 1 cause that parser would not access network and optional parameter recover set 1 cause that parser would not call die() on errors. It is important, that each schema only have a single source. The location parameter allows one to parse a schema from the filesystem or a (non-HTTPS) URL. The string parameter will parse the schema from the given XML string. Note that the constructor will die() if the schema does not meed the constraints of the XML Schema specification. validate eval { $xmlschema->validate( $doc ); }; This function allows one to validate a (parsed) document against the given XML Schema. The argument of this function should be a object. If this function succeeds, it will return 0, otherwise it will die() and report the errors found. Because of this validate() should be always evaluated. XPath Evaluation XML::LibXML::XPathContext Description The XML::LibXML::XPathContext class provides an almost complete interface to libxml2's XPath implementation. With XML::LibXML::XPathContext, it is possible to evaluate XPath expressions in the context of arbitrary node, context size, and context position, with a user-defined namespace-prefix mapping, custom XPath functions written in Perl, and even a custom XPath variable resolver. Examples Namespaces This example demonstrates registerNs() method. It finds all paragraph nodes in an XHTML document. my $xc = XML::LibXML::XPathContext->new($xhtml_doc); $xc->registerNs('xhtml', 'http://www.w3.org/1999/xhtml'); my @nodes = $xc->findnodes('//xhtml:p'); Custom XPath functions This example demonstrates registerFunction() method by defining a function filtering nodes based on a Perl regular expression: sub grep_nodes { my ($nodelist,$regexp) = @_; my $result = XML::LibXML::NodeList->new; for my $node ($nodelist->get_nodelist()) { $result->push($node) if $node->textContent =~ $regexp; } return $result; }; my $xc = XML::LibXML::XPathContext->new($node); $xc->registerFunction('grep_nodes', \&grep_nodes); my @nodes = $xc->findnodes('//section[grep_nodes(para,"\bsearch(ing|es)?\b")]'); Variables This example demonstrates registerVarLookup() method. We use XPath variables to recycle results of previous evaluations: sub var_lookup { my ($varname,$ns,$data)=@_; return $data->{$varname}; } my $areas = XML::LibXML->new->parse_file('areas.xml'); my $empl = XML::LibXML->new->parse_file('employees.xml'); my $xc = XML::LibXML::XPathContext->new($empl); my %variables = ( A => $xc->find('/employees/employee[@salary>10000]'), B => $areas->find('/areas/area[district='Brooklyn']/street'), ); # get names of employees from $A working in an area listed in $B $xc->registerVarLookupFunc(\&var_lookup, \%variables); my @nodes = $xc->findnodes('$A[work_area/street = $B]/name'); Methods new my $xpc = XML::LibXML::XPathContext->new(); Creates a new XML::LibXML::XPathContext object without a context node. my $xpc = XML::LibXML::XPathContext->new($node); Creates a new XML::LibXML::XPathContext object with the context node set to $node. registerNs $xpc->registerNs($prefix, $namespace_uri) Registers namespace $prefix to $namespace_uri. unregisterNs $xpc->unregisterNs($prefix) Unregisters namespace $prefix. lookupNs $uri = $xpc->lookupNs($prefix) Returns namespace URI registered with $prefix. If $prefix is not registered to any namespace URI returns undef. registerVarLookupFunc $xpc->registerVarLookupFunc($callback, $data) Registers variable lookup function $callback. The registered function is executed by the XPath engine each time an XPath variable is evaluated. It takes three arguments: $data, variable name, and variable ns-URI and must return one value: a number or string or any XML::LibXML:: object that can be a result of findnodes: Boolean, Literal, Number, Node (e.g. Document, Element, etc.), or NodeList. For convenience, simple (non-blessed) array references containing only objects can be used instead of an XML::LibXML::NodeList. getVarLookupData $data = $xpc->getVarLookupData(); Returns the data that have been associated with a variable lookup function during a previous call to registerVarLookupFunc. getVarLookupFunc $callback = $xpc->getVarLookupFunc(); Returns the variable lookup function previously registered with registerVarLookupFunc. unregisterVarLookupFunc $xpc->unregisterVarLookupFunc($name); Unregisters variable lookup function and the associated lookup data. registerFunctionNS $xpc->registerFunctionNS($name, $uri, $callback) Registers an extension function $name in $uri namespace. $callback must be a CODE reference. The arguments of the callback function are either simple scalars or XML::LibXML::* objects depending on the XPath argument types. The function is responsible for checking the argument number and types. Result of the callback code must be a single value of the following types: a simple scalar (number, string) or an arbitrary XML::LibXML::* object that can be a result of findnodes: Boolean, Literal, Number, Node (e.g. Document, Element, etc.), or NodeList. For convenience, simple (non-blessed) array references containing only objects can be used instead of a XML::LibXML::NodeList. unregisterFunctionNS $xpc->unregisterFunctionNS($name, $uri) Unregisters extension function $name in $uri namespace. Has the same effect as passing undef as $callback to registerFunctionNS. registerFunction $xpc->registerFunction($name, $callback) Same as registerFunctionNS but without a namespace. unregisterFunction $xpc->unregisterFunction($name) Same as unregisterFunctionNS but without a namespace. findnodes @nodes = $xpc->findnodes($xpath) @nodes = $xpc->findnodes($xpath, $context_node ) $nodelist = $xpc->findnodes($xpath, $context_node ) Performs the xpath statement on the current node and returns the result as an array. In scalar context, returns an XML::LibXML::NodeList object. Optionally, a node may be passed as a second argument to set the context node for the query. The xpath expression can be passed either as a string, or as a XML::LibXML::XPathExpression object. find $object = $xpc->find($xpath ) $object = $xpc->find($xpath, $context_node ) Performs the xpath expression using the current node as the context of the expression, and returns the result depending on what type of result the XPath expression had. For example, the XPath 1 * 3 + 52 results in an XML::LibXML::Number object being returned. Other expressions might return a XML::LibXML::Boolean object, or a XML::LibXML::Literal object (a string). Each of those objects uses Perl's overload feature to ``do the right thing'' in different contexts. Optionally, a node may be passed as a second argument to set the context node for the query. The xpath expression can be passed either as a string, or as a XML::LibXML::XPathExpression object. findvalue $value = $xpc->findvalue($xpath ) $value = $xpc->findvalue($xpath, $context_node ) Is exactly equivalent to: $xpc->find( $xpath, $context_node )->to_literal; That is, it returns the literal value of the results. This enables you to ensure that you get a string back from your search, allowing certain shortcuts. This could be used as the equivalent of <xsl:value-of select=``some_xpath''/>. Optionally, a node may be passed in the second argument to set the context node for the query. The xpath expression can be passed either as a string, or as a XML::LibXML::XPathExpression object. exists $bool = $xpc->exists( $xpath_expression, $context_node ); This method behaves like findnodes, except that it only returns a boolean value (1 if the expression matches a node, 0 otherwise) and may be faster than findnodes, because the XPath evaluation may stop early on the first match (this is true for libxml2 >= 2.6.27). For XPath expressions that do not return node-set, the method returns true if the returned value is a non-zero number or a non-empty string. setContextNode $xpc->setContextNode($node) Set the current context node. getContextNode my $node = $xpc->getContextNode; Get the current context node. setContextPosition $xpc->setContextPosition($position) Set the current context position. By default, this value is -1 (and evaluating XPath function position() in the initial context raises an XPath error), but can be set to any value up to context size. This usually only serves to cheat the XPath engine to return given position when position() XPath function is called. Setting this value to -1 restores the default behavior. getContextPosition my $position = $xpc->getContextPosition; Get the current context position. setContextSize $xpc->setContextSize($size) Set the current context size. By default, this value is -1 (and evaluating XPath function last() in the initial context raises an XPath error), but can be set to any non-negative value. This usually only serves to cheat the XPath engine to return the given value when last() XPath function is called. If context size is set to 0, position is automatically also set to 0. If context size is positive, position is automatically set to 1. Setting context size to -1 restores the default behavior. getContextSize my $size = $xpc->getContextSize; Get the current context size. setContextNode $xpc->setContextNode($node) Set the current context node. Bugs And Caveats XML::LibXML::XPathContext objects are reentrant, meaning that you can call methods of an XML::LibXML::XPathContext even from XPath extension functions registered with the same object or from a variable lookup function. On the other hand, you should rather avoid registering new extension functions, namespaces and a variable lookup function from within extension functions and a variable lookup function, unless you want to experience untested behavior. Authors Ilya Martynov and Petr Pajas, based on XML::LibXML and XML::LibXSLT code by Matt Sergeant and Christian Glahn. Historical remark Prior to XML::LibXML 1.61 this module was distributed separately for maintenance reasons. XML::LibXML::Reader - interface to libxml2 pull parser XML::LibXML::Reader Synopsis use XML::LibXML::Reader; my $reader = XML::LibXML::Reader->new(location => "file.xml") or die "cannot read file.xml\n"; while ($reader->read) { processNode($reader); } sub processNode { my $reader = shift; printf "%d %d %s %d\n", ($reader->depth, $reader->nodeType, $reader->name, $reader->isEmptyElement); } or my $reader = XML::LibXML::Reader->new(location => "file.xml") or die "cannot read file.xml\n"; $reader->preservePattern('//table/tr'); $reader->finish; print $reader->document->toString(1); DESCRIPTION This is a perl interface to libxml2's pull-parser implementation xmlTextReader http://xmlsoft.org/html/libxml-xmlreader.html. This feature requires at least libxml2-2.6.21. Pull-parsers (such as StAX in Java, or XmlReader in C#) use an iterator approach to parse XML documents. They are easier to program than event-based parser (SAX) and much more lightweight than tree-based parser (DOM), which load the complete tree into memory. The Reader acts as a cursor going forward on the document stream and stopping at each node on the way. At every point, the DOM-like methods of the Reader object allow one to examine the current node (name, namespace, attributes, etc.) The user's code keeps control of the progress and simply calls the read() function repeatedly to progress to the next node in the document order. Other functions provide means for skipping complete sub-trees, or nodes until a specific element, etc. At every time, only a very limited portion of the document is kept in the memory, which makes the API more memory-efficient than using DOM. However, it is also possible to mix Reader with DOM. At every point the user may copy the current node (optionally expanded into a complete sub-tree) from the processed document to another DOM tree, or to instruct the Reader to collect sub-document in form of a DOM tree consisting of selected nodes. Reader API also supports namespaces, xml:base, entity handling, and DTD validation. Schema and RelaxNG validation support will probably be added in some later revision of the Perl interface. The naming of methods compared to libxml2 and C# XmlTextReader has been changed slightly to match the conventions of XML::LibXML. Some functions have been changed or added with respect to the C interface. CONSTRUCTOR Depending on the XML source, the Reader object can be created with either of: my $reader = XML::LibXML::Reader->new( location => "file.xml", ... ); my $reader = XML::LibXML::Reader->new( string => $xml_string, ... ); my $reader = XML::LibXML::Reader->new( IO => $file_handle, ... ); my $reader = XML::LibXML::Reader->new( FD => fileno(STDIN), ... ); my $reader = XML::LibXML::Reader->new( DOM => $dom, ... ); where ... are (optional) reader options described below in or various parser options described in . The constructor recognizes the following XML sources: Source specification location Read XML from a local file or (non-HTTPS) URL. string Read XML from a string. IO Read XML a Perl IO filehandle. FD Read XML from a file descriptor (bypasses Perl I/O layer, only applicable to filehandles for regular files or pipes). Possibly faster than IO. DOM Use reader API to walk through a pre-parsed . Reader options encoding => $encoding override document encoding. RelaxNG => $rng_schema can be used to pass either a object or a filename or (non-HTTPS) URL of a RelaxNG schema to the constructor. The schema is then used to validate the document as it is processed. Schema => $xsd_schema can be used to pass either a object or a filename or (non-HTTPS) URL of a W3C XSD schema to the constructor. The schema is then used to validate the document as it is processed. ... the reader further supports various parser options described in (specifically those labeled by /reader/). METHODS CONTROLLING PARSING PROGRESS read () Moves the position to the next node in the stream, exposing its properties. Returns 1 if the node was read successfully, 0 if there is no more nodes to read, or -1 in case of error readAttributeValue () Parses an attribute value into one or more Text and EntityReference nodes. Returns 1 in case of success, 0 if the reader was not positioned on an attribute node or all the attribute values have been read, or -1 in case of error. readState () Gets the read state of the reader. Returns the state value, or -1 in case of error. The module exports constants for the Reader states, see STATES below. depth () The depth of the node in the tree, starts at 0 for the root node. next () Skip to the node following the current one in the document order while avoiding the sub-tree if any. Returns 1 if the node was read successfully, 0 if there is no more nodes to read, or -1 in case of error. nextElement (localname?,nsURI?) Skip nodes following the current one in the document order until a specific element is reached. The element's name must be equal to a given localname if defined, and its namespace must equal to a given nsURI if defined. Either of the arguments can be undefined (or omitted, in case of the latter or both). Returns 1 if the element was found, 0 if there is no more nodes to read, or -1 in case of error. nextPatternMatch (compiled_pattern) Skip nodes following the current one in the document order until an element matching a given compiled pattern is reached. See for information on compiled patterns. See also the matchesPattern method. Returns 1 if the element was found, 0 if there is no more nodes to read, or -1 in case of error. skipSiblings () Skip all nodes on the same or lower level until the first node on a higher level is reached. In particular, if the current node occurs in an element, the reader stops at the end tag of the parent element, otherwise it stops at a node immediately following the parent node. Returns 1 if successful, 0 if end of the document is reached, or -1 in case of error. nextSibling () It skips to the node following the current one in the document order while avoiding the sub-tree if any. Returns 1 if the node was read successfully, 0 if there is no more nodes to read, or -1 in case of error nextSiblingElement (name?,nsURI?) Like nextElement but only processes sibling elements of the current node (moving forward using nextSibling () rather than read (), internally). Returns 1 if the element was found, 0 if there is no more sibling nodes, or -1 in case of error. finish () Skip all remaining nodes in the document, reaching end of the document. Returns 1 if successful, 0 in case of error. close () This method releases any resources allocated by the current instance and closes any underlying input. It returns 0 on failure and 1 on success. This method is automatically called by the destructor when the reader is forgotten, therefore you do not have to call it directly. METHODS EXTRACTING INFORMATION name () Returns the qualified name of the current node, equal to (Prefix:)LocalName. nodeType () Returns the type of the current node. See NODE TYPES below. localName () Returns the local name of the node. prefix () Returns the prefix of the namespace associated with the node. namespaceURI () Returns the URI defining the namespace associated with the node. isEmptyElement () Check if the current node is empty, this is a bit bizarre in the sense that <a/> will be considered empty while <a></a> will not. hasValue () Returns true if the node can have a text value. value () Provides the text value of the node if present or undef if not available. readInnerXml () Reads the contents of the current node, including child nodes and markup. Returns a string containing the XML of the node's content, or undef if the current node is neither an element nor attribute, or has no child nodes. readOuterXml () Reads the contents of the current node, including child nodes and markup. Returns a string containing the XML of the node including its content, or undef if the current node is neither an element nor attribute. nodePath() Returns a canonical location path to the current element from the root node to the current node. Namespaced elements are matched by '*', because there is no way to declare prefixes within XPath patterns. Unlike XML::LibXML::Node::nodePath(), this function does not provide sibling counts (i.e. instead of e.g. '/a/b[1]' and '/a/b[2]' you get '/a/b' for both matches). matchesPattern(compiled_pattern) Returns a true value if the current node matches a compiled pattern. See for information on compiled patterns. See also the nextPatternMatch method. METHODS EXTRACTING DOM NODES document () Provides access to the document tree built by the reader. This function can be used to collect the preserved nodes (see preserveNode() and preservePattern). CAUTION: Never use this function to modify the tree unless reading of the whole document is completed! copyCurrentNode (deep) This function is similar a DOM function copyNode(). It returns a copy of the currently processed node as a corresponding DOM object. Use deep = 1 to obtain the full sub-tree. preserveNode () This tells the XML Reader to preserve the current node in the document tree. A document tree consisting of the preserved nodes and their content can be obtained using the method document() once parsing is finished. Returns the node or NULL in case of error. preservePattern (pattern,\%ns_map) This tells the XML Reader to preserve all nodes matched by the pattern (which is a streaming XPath subset). A document tree consisting of the preserved nodes and their content can be obtained using the method document() once parsing is finished. An optional second argument can be used to provide a HASH reference mapping prefixes used by the XPath to namespace URIs. The XPath subset available with this function is described at http://www.w3.org/TR/xmlschema-1/#Selector and matches the production Path ::= ('.//')? ( Step '/' )* ( Step | '@' NameTest ) Returns a positive number in case of success and -1 in case of error METHODS PROCESSING ATTRIBUTES attributeCount () Provides the number of attributes of the current node. hasAttributes () Whether the node has attributes. getAttribute (name) Provides the value of the attribute with the specified qualified name. Returns a string containing the value of the specified attribute, or undef in case of error. getAttributeNs (localName, namespaceURI) Provides the value of the specified attribute. Returns a string containing the value of the specified attribute, or undef in case of error. getAttributeNo (no) Provides the value of the attribute with the specified index relative to the containing element. Returns a string containing the value of the specified attribute, or undef in case of error. isDefault () Returns true if the current attribute node was generated from the default value defined in the DTD. moveToAttribute (name) Moves the position to the attribute with the specified local name and namespace URI. Returns 1 in case of success, -1 in case of error, 0 if not found moveToAttributeNo (no) Moves the position to the attribute with the specified index relative to the containing element. Returns 1 in case of success, -1 in case of error, 0 if not found moveToAttributeNs (localName,namespaceURI) Moves the position to the attribute with the specified local name and namespace URI. Returns 1 in case of success, -1 in case of error, 0 if not found moveToFirstAttribute () Moves the position to the first attribute associated with the current node. Returns 1 in case of success, -1 in case of error, 0 if not found moveToNextAttribute () Moves the position to the next attribute associated with the current node. Returns 1 in case of success, -1 in case of error, 0 if not found moveToElement () Moves the position to the node that contains the current attribute node. Returns 1 in case of success, -1 in case of error, 0 if not moved isNamespaceDecl () Determine whether the current node is a namespace declaration rather than a regular attribute. Returns 1 if the current node is a namespace declaration, 0 if it is a regular attribute or other type of node, or -1 in case of error. OTHER METHODS lookupNamespace (prefix) Resolves a namespace prefix in the scope of the current element. Returns a string containing the namespace URI to which the prefix maps or undef in case of error. encoding () Returns a string containing the encoding of the document or undef in case of error. standalone () Determine the standalone status of the document being read. Returns 1 if the document was declared to be standalone, 0 if it was declared to be not standalone, or -1 if the document did not specify its standalone status or in case of error. xmlVersion () Determine the XML version of the document being read. Returns a string containing the XML version of the document or undef in case of error. baseURI () Returns the base URI of a given node. isValid () Retrieve the validity status from the parser. Returns 1 if valid, 0 if no, and -1 in case of error. xmlLang () The xml:lang scope within which the node resides. lineNumber () Provide the line number of the current parsing point. columnNumber () Provide the column number of the current parsing point. byteConsumed () This function provides the current index of the parser relative to the start of the current entity. This function is computed in bytes from the beginning starting at zero and finishing at the size in bytes of the file if parsing a file. The function is of constant cost if the input is UTF-8 but can be costly if run on non-UTF-8 input. setParserProp (prop => value, ...) Change the parser processing behaviour by changing some of its internal properties. The following properties are available with this function: ``load_ext_dtd'', ``complete_attributes'', ``validation'', ``expand_entities''. Since some of the properties can only be changed before any read has been done, it is best to set the parsing properties at the constructor. Returns 0 if the call was successful, or -1 in case of error getParserProp (prop) Get value of an parser internal property. The following property names can be used: ``load_ext_dtd'', ``complete_attributes'', ``validation'', ``expand_entities''. Returns the value, usually 0 or 1, or -1 in case of error. DESTRUCTION XML::LibXML takes care of the reader object destruction when the last reference to the reader object goes out of scope. The document tree is preserved, though, if either of $reader->document or $reader->preserveNode was used and references to the document tree exist. NODE TYPES The reader interface provides the following constants for node types (the constant symbols are exported by default or if tag :types is used). XML_READER_TYPE_NONE => 0 XML_READER_TYPE_ELEMENT => 1 XML_READER_TYPE_ATTRIBUTE => 2 XML_READER_TYPE_TEXT => 3 XML_READER_TYPE_CDATA => 4 XML_READER_TYPE_ENTITY_REFERENCE => 5 XML_READER_TYPE_ENTITY => 6 XML_READER_TYPE_PROCESSING_INSTRUCTION => 7 XML_READER_TYPE_COMMENT => 8 XML_READER_TYPE_DOCUMENT => 9 XML_READER_TYPE_DOCUMENT_TYPE => 10 XML_READER_TYPE_DOCUMENT_FRAGMENT => 11 XML_READER_TYPE_NOTATION => 12 XML_READER_TYPE_WHITESPACE => 13 XML_READER_TYPE_SIGNIFICANT_WHITESPACE => 14 XML_READER_TYPE_END_ELEMENT => 15 XML_READER_TYPE_END_ENTITY => 16 XML_READER_TYPE_XML_DECLARATION => 17 STATES The following constants represent the values returned by readState(). They are exported by default, or if tag :states is used: XML_READER_NONE => -1 XML_READER_START => 0 XML_READER_ELEMENT => 1 XML_READER_END => 2 XML_READER_EMPTY => 3 XML_READER_BACKTRACK => 4 XML_READER_DONE => 5 XML_READER_ERROR => 6 SEE ALSO for information about compiled patterns. http://xmlsoft.org/html/libxml-xmlreader.html http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html ORIGINAL IMPLEMENTATION Heiko Klein, <H.Klein@gmx.net<gt> and Petr Pajas XML::LibXML::XPathExpression - interface to libxml2 pre-compiled XPath expressions XML::LibXML::XPathExpression Synopsis use XML::LibXML; my $compiled_xpath = XML::LibXML::XPathExpression->new('//foo[@bar="baz"][position()<4]'); # interface from XML::LibXML::Node my $result = $node->find($compiled_xpath); my @nodes = $node->findnodes($compiled_xpath); my $value = $node->findvalue($compiled_xpath); # interface from XML::LibXML::XPathContext my $result = $xpc->find($compiled_xpath,$node); my @nodes = $xpc->findnodes($compiled_xpath,$node); my $value = $xpc->findvalue($compiled_xpath,$node); Description This is a perl interface to libxml2's pre-compiled XPath expressions. Pre-compiling an XPath expression can give in some performance benefit if the same XPath query is evaluated many times. XML::LibXML::XPathExpression objects can be passed to all find... functions XML::LibXML that expect an XPath expression. new() $compiled = XML::LibXML::XPathExpression->new( xpath_string ); The constructor takes an XPath 1.0 expression as a string and returns an object representing the pre-compiled expressions (the actual data structure is internal to libxml2). XML::LibXML::Pattern - interface to libxml2 XPath patterns XML::LibXML::Pattern Synopsis use XML::LibXML; my $pattern = XML::LibXML::Pattern->new('/x:html/x:body//x:div', { 'x' => 'http://www.w3.org/1999/xhtml' }); # test a match on an XML::LibXML::Node $node if ($pattern->matchesNode($node)) { ... } # or on an XML::LibXML::Reader if ($reader->matchesPattern($pattern)) { ... } # or skip reading all nodes that do not match print $reader->nodePath while $reader->nextPatternMatch($pattern); Description This is a perl interface to libxml2's pattern matching support http://xmlsoft.org/html/libxml-pattern.html. This feature requires recent versions of libxml2. Patterns are a small subset of XPath language, which is limited to (disjunctions of) location paths involving the child and descendant axes in abbreviated form as described by the extended BNF given below: Selector ::= Path ( '|' Path )* Path ::= ('.//' | '//' | '/' )? Step ( '/' Step )* Step ::= '.' | NameTest NameTest ::= QName | '*' | NCName ':' '*' For readability, whitespace may be used in selector XPath expressions even though not explicitly allowed by the grammar: whitespace may be freely added within patterns before or after any token, where token ::= '.' | '/' | '//' | '|' | NameTest Note that no predicates or attribute tests are allowed. Patterns are particularly useful for stream parsing provided via the XML::LibXML::Reader interface. new() $pattern = XML::LibXML::Pattern->new( pattern, { prefix => namespace_URI, ... } ); The constructor of a pattern takes a pattern expression (as described by the BNF grammar above) and an optional HASH reference mapping prefixes to namespace URIs. The method returns a compiled pattern object. Note that if the document has a default namespace, it must still be given an prefix in order to be matched (as demanded by the XPath 1.0 specification). For example, to match an element <a xmlns="http://foo.bar"</a>, one should use a pattern like this: $pattern = XML::LibXML::Pattern->new( 'foo:a', { foo => 'http://foo.bar' }); matchesNode($node) $bool = $pattern->matchesNode($node); Given an XML::LibXML::Node object, returns a true value if the node is matched by the compiled pattern expression. SEE ALSO for other methods involving compiled patterns. XML::LibXML::RegExp - interface to libxml2 regular expressions XML::LibXML::RegExp Synopsis use XML::LibXML; my $compiled_re = XML::LibXML::RegExp->new('[0-9]{5}(-[0-9]{4})?'); if ($compiled_re->isDeterministic()) { ... } if ($compiled_re->matches($string)) { ... } Description This is a perl interface to libxml2's implementation of regular expressions, which are used e.g. for validation of XML Schema simple types (pattern facet). new() $compiled_re = XML::LibXML::RegExp->new( $regexp_str ); The constructor takes a string containing a regular expression and returns a compiled regexp object. matches($string) $bool = $compiled_re->matches($string); Given a string value, returns a true value if the value is matched by the compiled regular expression. isDeterministic() $bool = $compiled_re->isDeterministic(); Returns a true value if the regular expression is deterministic; returns false otherwise. (See the definition of determinism in the XML spec) A map for named nodes XML::LibXML::NamedNodeMap Synopsis use XML::LibXML; my $map = XML::LibXML::NamedNodeMap->new(@nodes); my $nodes_list = $map->nodes(); my $node_with_index_2 = $map->item(2); Description XML::LibXML::NamedNodeMap maps nodes' names to nodes. Methods length my $length = $map->length; Returns the number of nodes in the map. nodes my $nodes_ref = $node->nodes() Returns a reference to the list of nodes. item my $node_2 = $map->item(2); Returns the node with the index of the argument (starting from 0) getNamedItem my $node = $map->getNamedItem('phone_number'); Returns the node with the name. setNamedItem $map->setNamedItem($new_node) Sets the node with the same name as $new_node to $new_node. removeNamedItem $map->removeNamedItem($name) Remove the item with the name $name. getNamedItemNS Not implemented yet.. setNamedItemNS Not implemented yet.. removeNamedItemNS Not implemented yet.. Structured Errors XML::LibXML::Error Synopsis eval { ... }; if (ref($@)) { # handle a structured error (XML::LibXML::Error object) } elsif ($@) { # error, but not an XML::LibXML::Error object } else { # no error } Description The XML::LibXML::Error class is a tiny frontend to libxml2's structured error support. If XML::LibXML is compiled with structured error support, all errors reported by libxml2 are transformed to XML::LibXML::Error objects. These objects automatically serialize to the corresponding error messages when printed or used in a string operation, but as objects, can also be used to get a detailed and structured information about the error that occurred. Unlike most other XML::LibXML objects, XML::LibXML::Error doesn't wrap an underlying libxml2 structure directly, but rather transforms it to a blessed Perl hash reference containing the individual fields of the structured error information as hash key-value pairs. Individual items (fields) of a structured error can either be obtained directly as $@->{field}, or using autoloaded methods such as $@->field() (where field is the field name). XML::LibXML::Error objects have the following fields: domain, code, level, file, line, nodename, message, str1, str2, str3, num1, num2, and _prev (some of them may be undefined). $XML::LibXML::Error::WARNINGS $XML::LibXML::Error::WARNINGS=1; Traditionally, XML::LibXML was suppressing parser warnings by setting libxml2's global variable xmlGetWarningsDefaultValue to 0. Since 1.70 we do not change libxml2's global variables anymore; for backward compatibility, XML::LibXML suppresses warnings. This variable can be set to 1 to enable reporting of these warnings via Perl warn and to 2 to report hem via die. as_string $message = $@->as_string(); This function serializes an XML::LibXML::Error object to a string containing the full error message close to the message produced by libxml2 default error handlers and tools like xmllint. This method is also used to overload "" operator on XML::LibXML::Error, so it is automatically called whenever XML::LibXML::Error object is treated as a string (e.g. in print $@). dump print $@->dump(); This function serializes an XML::LibXML::Error to a string displaying all fields of the error structure individually on separate lines of the form 'name' => 'value'. domain $error_domain = $@->domain(); Returns string containing information about what part of the library raised the error. Can be one of: "parser", "tree", "namespace", "validity", "HTML parser", "memory", "output", "I/O", "ftp", "http", "XInclude", "XPath", "xpointer", "regexp", "Schemas datatype", "Schemas parser", "Schemas validity", "Relax-NG parser", "Relax-NG validity", "Catalog", "C14N", "XSLT", "validity". code $error_code = $@->code(); Returns the actual libxml2 error code. The XML::LibXML::ErrNo module defines constants for individual error codes. Currently libxml2 uses over 480 different error codes. message $error_message = $@->message(); Returns a human-readable informative error message. level $error_level = $@->level(); Returns an integer value describing how consequent is the error. XML::LibXML::Error defines the following constants: XML_ERR_NONE = 0 XML_ERR_WARNING = 1 : A simple warning. XML_ERR_ERROR = 2 : A recoverable error. XML_ERR_FATAL = 3 : A fatal error. file $filename = $@->file(); Returns the filename of the file being processed while the error occurred. line $line = $@->line(); The line number, if available. nodename $nodename = $@->nodename(); Name of the node where error occurred, if available. When this field is non-empty, libxml2 actually returned a physical pointer to the specified node. Due to memory management issues, it is very difficult to implement a way to expose the pointer to the Perl level as a XML::LibXML::Node. For this reason, XML::LibXML::Error currently only exposes the name the node. str1 $error_str1 = $@->str1(); Error specific. Extra string information. str2 $error_str2 = $@->str2(); Error specific. Extra string information. str3 $error_str3 = $@->str3(); Error specific. Extra string information. num1 $error_num1 = $@->num1(); Error specific. Extra numeric information. num2 $error_num2 = $@->num2(); In recent libxml2 versions, this value contains a column number of the error or 0 if N/A. context $string = $@->context(); For parsing errors, this field contains about 80 characters of the XML near the place where the error occurred. The field $@->column() contains the corresponding offset. Where N/A, the field is undefined. column $offset = $@->column(); See $@->column() above. _prev $previous_error = $@->_prev(); This field can possibly hold a reference to another XML::LibXML::Error object representing an error which occurred just before this error. Structured Errors XML::LibXML::ErrNo Description This module is based on xmlerror.h libxml2 C header file. It defines symbolic constants for all libxml2 error codes. Currently libxml2 uses over 480 different error codes. See also XML::LibXML::Error. Constants and Character Encoding Routines XML::LibXML::Common Synopsis use XML::LibXML::Common; Description XML::LibXML::Common defines constants for all node types and provides interface to libxml2 charset conversion functions. Since XML::LibXML use their own node type definitions, one may want to use XML::LibXML::Common in its compatibility mode: Exporter TAGS use XML::LibXML::Common qw(:libxml); :libxml tag will use the XML::LibXML Compatibility mode, which defines the old 'XML_' node-type definitions. use XML::LibXML::Common qw(:gdome); :gdome tag will use the XML::GDOME Compatibility mode, which defines the old 'GDOME_' node-type definitions. use XML::LibXML::Common qw(:w3c); This uses the nodetype definition names as specified for DOM. use XML::LibXML::Common qw(:encoding); This tag can be used to export only the charset encoding functions of XML::LibXML::Common. Exports By default the W3 definitions as defined in the DOM specifications and the encoding functions are exported by XML::LibXML::Common. Encoding functions To encode or decode a string to or from UTF-8, XML::LibXML::Common exports two functions, which provide an interface to the encoding support in libxml2. Which encodings are supported by these functions depends on how libxml2 was compiled. UTF-16 is always supported and on most installations, ISO encodings are supported as well. This interface was useful for older versions of Perl. Since Perl >= 5.8 provides similar functions via the Encode module, it is probably a good idea to use those instead. encodeToUTF8 $encodedstring = encodeToUTF8( $name_of_encoding, $sting_to_encode ); The function will convert a byte string from the specified encoding to an UTF-8 encoded character string. decodeToUTF8 $decodedstring = decodeFromUTF8($name_of_encoding, $string_to_decode ); This function converts an UTF-8 encoded character string to a specified encoding. Note that the conversion can raise an error if the given string contains characters that cannot be represented in the target encoding. Both these functions report their errors on the standard error. If an error occurs the function will croak(). To catch the error information it is required to call the encoding function from within an eval block in order to prevent the entire script from being stopped on encoding error. A note on history Before XML::LibXML 1.70, this class was available as a separate CPAN distribution, intended to provide functionality shared between XML::LibXML, XML::GDOME, and possibly other modules. Since there seems to be no progress in this direction, we decided to merge XML::LibXML::Common 0.13 and XML::LibXML 1.70 to one CPAN distribution. The merge also naturally eliminates a practical and urgent problem experienced by many XML::LibXML users on certain platforms, namely mysterious misbehavior of XML::LibXML occurring if the installed (often pre-packaged) version of XML::LibXML::Common was compiled against an older version of libxml2 than XML::LibXML.