commit 661b9f3e477f48c116a2b5c350cf71ffe6f9e719 Author: su-fang Date: Wed Sep 14 14:44:12 2022 +0800 Import Upstream version 1.02 diff --git a/Changes b/Changes new file mode 100644 index 0000000..09cf977 --- /dev/null +++ b/Changes @@ -0,0 +1,108 @@ +Revision history for Perl extension XML::SAX. + +1.02 14 Jun 2019 Grant McLean + - Spelling fixes (patch from Ville Skyttä) + - Add repo location to metadata (patches from Ville Skyttä & Martin McGrath) + - Reorganise module files under lib/XML + - Regenerate MANIFEST using 'make manifest' to include missing test files + +1.00 15 Feb 2018 Grant McLean + - Add makefile dependency to fix order of build steps RT#62289 (patch from + Ed J) + +0.99 05 Sep 2011 Grant McLean + - Re-release without nested XML-SAX-Base distribution, add prereq instead + (note no other changes from 0.96 release) + +0.96 06 Aug 2008 Grant McLean + - Fix breakage of Unicode regexes on 5.6 (introduced in 0.95 release) + +0.95 05 Aug 2008 Grant McLean + - XML::SAX::PurePerl fixes: + - RT#37147: Fix handling of numeric character entities in attribute + values (report from Jools Smyth) + - RT#19442: Fix for numeric character entities spanning end of buffer + (report from Eivind Eklund) + - RT#29316: Performance fix for parsing from large strings (patch from + Gordon Lack) + - RT#26588: Fix for UTF8 bytes in first 4096 bytes of document not being + decoded to Perl-UTF8-characters (patch from Niko Tyni of the + Debian project) + - RT#37545: incorrect operator precedence breaks single quotes around + DTD entity declarations (report from Kevin Ryde) + - RT#28477: Fix test in ParserFactory.pm for parser module loaded (report + from Douglas Wilson) + - RT#28564: Fix XML::SAX::PurePerl versioning (report from Chapman Flack) + +0.16 27 Jun 2007 Grant McLean + - Applied patch for PI handling from RT#19173 + +0.15 08 Feb 2007 Grant McLean + - Fixed handling of entities in attribute values + - Cleaned up some benign warnings + +0.14 23 Apr 2006 Matt Sergeant + - Fixed CDATA section parsing (Uwe Voelker) + - Fix Makefile.PL for VMS + - Support calling set_handler() mid-parse + - Fix for when random modules overload UNIVERSAL::AUTOLOAD() + - Fix case when ParserDetails.ini isn't being updated but we are doing an + upgrade. + +0.13 24 Oct 2005 Matt Sergeant + - Complete re-write of XML::SAX::PurePerl for performance + - Support Encoding & XMLVersion in DocumentLocator interface + - A few conformance tweaks to match perl SAX 2.1. + +0.12 20 Nov 2002 Matt Sergeant + - Made sure SAX.ini works as documented + - Fixed when you specify Module (version) + +0.11 03 Sep 2002 Matt Sergeant + - Base: Merged in XML::SAX::Base 1.04 (including memory leak fixes) + - ParserFactory: Made it do what the docs say when you specify + a module version number. + - SAX: Fixed XML::SAX::Intro typo. + - ParserFactory: Fixed (and test) broken version in parser pkg + +0.10 14 Feb 2002 Matt Sergeant + - PurePerl: Fixed PubidChar missing '-' + - ParserFactory: Allow version in parser package + +0.09 06 Feb 2002 Matt Sergeant + - PurePerl: Performance enhancements + - PurePerl: Line End handling implemented. + - PurePerl: Attribute Value Normalization implemented (mostly). + - PurePerl: Fixed element copying for end_element. + +0.08 30 Jan 2002 Matt Sergeant + - Fixes for perl 5.7.2 + +0.07 29 Jan 2002 Matt Sergeant + - Fixed higher utf8 chars to work on both 5.005 and 5.6.1 + (not tested on 5.6.0 since it's a bad perl) + +0.06 28 Jan 2002 Matt Sergeant + - Fixed empty ParserDetails.ini bug + - Fixed overwriting ParserDetails.ini bug + - Added ability to specify dir to ParserDetails.ini + +0.05 21 Jan 2002 Matt Sergeant + - Updated PREREQ_PM in Makefile.PL + - Updated Docs + +0.04 21 Jan 2002 Matt Sergeant + - More PurePerl updates and fixes + - Added DocumentLocator class + - Added remove_parser() to XML::SAX + +0.03 25 Nov 2001 Matt Sergeant + - Changes lost. Mostly PurePerl parser updates + +0.02 14 Nov 2001 Matt Sergeant + - Removed windows line endings so perldocs look ok. + - Fixed some docs + +0.01 Mon Nov 5 22:24:06 2001 Matt Sergeant + - original version; created by elves and penguins v1.32 + diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..2a456d4 --- /dev/null +++ b/LICENSE @@ -0,0 +1,7 @@ +XML::SAX is dual licensed under the same terms as Perl itself. + +This means at your choice, either the Perl Artistic License, or +the GNU GPL version 1 or higher. + +(should we include the full text here?) + diff --git a/MANIFEST b/MANIFEST new file mode 100644 index 0000000..0bb8da3 --- /dev/null +++ b/MANIFEST @@ -0,0 +1,62 @@ +Changes +lib/XML/SAX.pm +lib/XML/SAX/DocumentLocator.pm +lib/XML/SAX/Intro.pod +lib/XML/SAX/ParserFactory.pm +lib/XML/SAX/PurePerl.pm +lib/XML/SAX/PurePerl/DebugHandler.pm +lib/XML/SAX/PurePerl/DocType.pm +lib/XML/SAX/PurePerl/DTDDecls.pm +lib/XML/SAX/PurePerl/EncodingDetect.pm +lib/XML/SAX/PurePerl/Exception.pm +lib/XML/SAX/PurePerl/NoUnicodeExt.pm +lib/XML/SAX/PurePerl/Productions.pm +lib/XML/SAX/PurePerl/Reader.pm +lib/XML/SAX/PurePerl/Reader/NoUnicodeExt.pm +lib/XML/SAX/PurePerl/Reader/Stream.pm +lib/XML/SAX/PurePerl/Reader/String.pm +lib/XML/SAX/PurePerl/Reader/UnicodeExt.pm +lib/XML/SAX/PurePerl/Reader/URI.pm +lib/XML/SAX/PurePerl/UnicodeExt.pm +lib/XML/SAX/PurePerl/XMLDecl.pm +LICENSE +Makefile.PL +MANIFEST +META.yml Module meta-data (added by MakeMaker) +README +t/00basic.t +t/01known.t +t/10xmldecl1.t +t/11xmldecl2.t +t/12miscstart.t +t/13int_ent.t +t/14encoding.t +t/15element.t +t/16large.t +t/19pi.t +t/20factory.t +t/21saxini.t +t/30parse_file.t +t/40cdata.t +t/42entities.t +t/99cleanup.t +testfiles/01.xml +testfiles/02a.xml +testfiles/02b.xml +testfiles/03a.xml +testfiles/03b.xml +testfiles/04a.xml +testfiles/06a.xml +testfiles/06b.xml +testfiles/06c.xml +testfiles/06d.xml +testfiles/06e.xml +testfiles/06f.xml +testfiles/06g.xml +testfiles/iso8859_1.xml +testfiles/iso8859_2.xml +testfiles/koi8_r.xml +testfiles/utf-16.xml +testfiles/utf-16le.xml +testfiles/xmltest.xml +META.json Module JSON meta-data (added by MakeMaker) diff --git a/META.json b/META.json new file mode 100644 index 0000000..6140370 --- /dev/null +++ b/META.json @@ -0,0 +1,50 @@ +{ + "abstract" : "unknown", + "author" : [ + "unknown" + ], + "dynamic_config" : 1, + "generated_by" : "ExtUtils::MakeMaker version 7.0401, CPAN::Meta::Converter version 2.150005", + "license" : [ + "unknown" + ], + "meta-spec" : { + "url" : "http://search.cpan.org/perldoc?CPAN::Meta::Spec", + "version" : "2" + }, + "name" : "XML-SAX", + "no_index" : { + "directory" : [ + "t", + "inc" + ] + }, + "prereqs" : { + "build" : { + "requires" : { + "ExtUtils::MakeMaker" : "0" + } + }, + "configure" : { + "requires" : { + "ExtUtils::MakeMaker" : "0" + } + }, + "runtime" : { + "requires" : { + "File::Temp" : "0", + "XML::NamespaceSupport" : "0.03", + "XML::SAX::Base" : "1.05" + } + } + }, + "release_status" : "stable", + "resources" : { + "repository" : { + "type" : "git", + "web" : "https://github.com/grantm/xml-sax" + } + }, + "version" : "1.02", + "x_serialization_backend" : "JSON::PP version 2.27300" +} diff --git a/META.yml b/META.yml new file mode 100644 index 0000000..9eaf40f --- /dev/null +++ b/META.yml @@ -0,0 +1,27 @@ +--- +abstract: unknown +author: + - unknown +build_requires: + ExtUtils::MakeMaker: '0' +configure_requires: + ExtUtils::MakeMaker: '0' +dynamic_config: 1 +generated_by: 'ExtUtils::MakeMaker version 7.0401, CPAN::Meta::Converter version 2.150005' +license: unknown +meta-spec: + url: http://module-build.sourceforge.net/META-spec-v1.4.html + version: '1.4' +name: XML-SAX +no_index: + directory: + - t + - inc +requires: + File::Temp: '0' + XML::NamespaceSupport: '0.03' + XML::SAX::Base: '1.05' +resources: + repository: https://github.com/grantm/xml-sax +version: '1.02' +x_serialization_backend: 'CPAN::Meta::YAML version 0.018' diff --git a/Makefile.PL b/Makefile.PL new file mode 100644 index 0000000..e3ff75d --- /dev/null +++ b/Makefile.PL @@ -0,0 +1,64 @@ +use ExtUtils::MakeMaker; +use File::Basename (); +use File::Spec (); + + +WriteMakefile( + 'NAME' => 'XML::SAX', + 'VERSION_FROM' => 'lib/XML/SAX.pm', # finds $VERSION + 'PREREQ_PM' => { + 'File::Temp' => 0, + 'XML::SAX::Base' => 1.05, + 'XML::NamespaceSupport' => 0.03, + }, + META_MERGE => { + "meta-spec" => { version => 2 }, + resources => { + repository => { + type => 'git', + url => 'git@github.com:grantm/XML-SAX.git', + web => 'https://github.com/grantm/xml-sax', + }, + }, + }, +); + +sub MY::install { + package MY; + my $script = shift->SUPER::install(@_); + + # Only modify existing ParserDetails.ini if user agrees + + my $write_ini_ok = 0; + + eval { require XML::SAX }; + if ($@) { + $write_ini_ok = 1; + } + else { + my $dir = File::Basename::dirname($INC{'XML/SAX.pm'}); + if (-e File::Spec->catfile($dir, 'SAX', 'ParserDetails.ini')) { + $write_ini_ok = + ExtUtils::MakeMaker::prompt( + "Do you want XML::SAX to alter ParserDetails.ini?", "Y" + ) =~ /^y/i; + } + else { + $write_ini_ok = 1; + } + } + + if ($write_ini_ok) { + $script =~ s/install :: (.*)$/install :: $1 install_sax_pureperl/m; + $script .= <<"INSTALL"; + +install_sax_pureperl : pure_install +\t\@\$(PERL) -MXML::SAX -e "XML::SAX->add_parser(q(XML::SAX::PurePerl))->save_parsers()" + +INSTALL + + } + + return $script; +} + diff --git a/README b/README new file mode 100644 index 0000000..aea07bb --- /dev/null +++ b/README @@ -0,0 +1,47 @@ +XML::SAX +======== + +XML::SAX consists of several framework classes for using and building +Perl SAX2 XML parsers, filters, and drivers. It is designed around the +need to be able to "plug in" different SAX parsers to an application +without requiring programmer intervention. Those of you familiar with +the DBI will be right at home. Some of the designs come from the Java +JAXP specification (SAX part), only without the javaness. + +INSTALLATION +============ + +All of this is 100% perl, so installation should be a breeze: + +On Unix and Unix-like platforms: + + $ perl Makefile.PL + $ make + $ make test + $ make install + +On ActivePerl, or Perl's built with MSVC: + + $ perl Makefile.PL + $ nmake + $ nmake test + $ nmake install + +USAGE +===== + +For usage, please read the included documentation after installation: + + $ perldoc XML::SAX + $ perldoc XML::SAX::ParserFactory + +And other documents leading off those pages. + +LICENSE/COPYRIGHT +================= + +Please see the LICENSE file for licensing issues. + +The files in this distribution are copyrighted their respective authors +as detailed in the POD documentation of each module. + diff --git a/lib/XML/SAX.pm b/lib/XML/SAX.pm new file mode 100644 index 0000000..7cd93d2 --- /dev/null +++ b/lib/XML/SAX.pm @@ -0,0 +1,379 @@ +# $Id$ + +package XML::SAX; + +use strict; +use vars qw($VERSION @ISA @EXPORT_OK); + +$VERSION = '1.02'; + +use Exporter (); +@ISA = ('Exporter'); + +@EXPORT_OK = qw(Namespaces Validation); + +use File::Basename qw(dirname); +use File::Spec (); +use Symbol qw(gensym); +use XML::SAX::ParserFactory (); # loaded for simplicity + +use constant PARSER_DETAILS => "ParserDetails.ini"; + +use constant Namespaces => "http://xml.org/sax/features/namespaces"; +use constant Validation => "http://xml.org/sax/features/validation"; + +my $known_parsers = undef; + +# load_parsers takes the ParserDetails.ini file out of the same directory +# that XML::SAX is in, and looks at it. Format in POD below + +=begin EXAMPLE + +[XML::SAX::PurePerl] +http://xml.org/sax/features/namespaces = 1 +http://xml.org/sax/features/validation = 0 +# a comment + +# blank lines ignored + +[XML::SAX::AnotherParser] +http://xml.org/sax/features/namespaces = 0 +http://xml.org/sax/features/validation = 1 + +=end EXAMPLE + +=cut + +sub load_parsers { + my $class = shift; + my $dir = shift; + + # reset parsers + $known_parsers = []; + + # get directory from wherever XML::SAX is installed + if (!$dir) { + $dir = $INC{'XML/SAX.pm'}; + $dir = dirname($dir); + } + + my $fh = gensym(); + if (!open($fh, File::Spec->catfile($dir, "SAX", PARSER_DETAILS))) { + XML::SAX->do_warn("could not find " . PARSER_DETAILS . " in $dir/SAX\n"); + return $class; + } + + $known_parsers = $class->_parse_ini_file($fh); + + return $class; +} + +sub _parse_ini_file { + my $class = shift; + my ($fh) = @_; + + my @config; + + my $lineno = 0; + while (defined(my $line = <$fh>)) { + $lineno++; + my $original = $line; + # strip whitespace + $line =~ s/\s*$//m; + $line =~ s/^\s*//m; + # strip comments + $line =~ s/[#;].*$//m; + # ignore blanks + next if $line =~ /^$/m; + + # heading + if ($line =~ /^\[\s*(.*)\s*\]$/m) { + push @config, { Name => $1 }; + next; + } + + # instruction + elsif ($line =~ /^(.*?)\s*?=\s*(.*)$/) { + unless(@config) { + push @config, { Name => '' }; + } + $config[-1]{Features}{$1} = $2; + } + + # not whitespace, comment, or instruction + else { + die "Invalid line in ini: $lineno\n>>> $original\n"; + } + } + + return \@config; +} + +sub parsers { + my $class = shift; + if (!$known_parsers) { + $class->load_parsers(); + } + return $known_parsers; +} + +sub remove_parser { + my $class = shift; + my ($parser_module) = @_; + + if (!$known_parsers) { + $class->load_parsers(); + } + + @$known_parsers = grep { $_->{Name} ne $parser_module } @$known_parsers; + + return $class; +} + +sub add_parser { + my $class = shift; + my ($parser_module) = @_; + + if (!$known_parsers) { + $class->load_parsers(); + } + + # first load module, then query features, then push onto known_parsers, + + my $parser_file = $parser_module; + $parser_file =~ s/::/\//g; + $parser_file .= ".pm"; + + require $parser_file; + + my @features = $parser_module->supported_features(); + + my $new = { Name => $parser_module }; + foreach my $feature (@features) { + $new->{Features}{$feature} = 1; + } + + # If exists in list already, move to end. + my $done = 0; + my $pos = undef; + for (my $i = 0; $i < @$known_parsers; $i++) { + my $p = $known_parsers->[$i]; + if ($p->{Name} eq $parser_module) { + $pos = $i; + } + } + if (defined $pos) { + splice(@$known_parsers, $pos, 1); + push @$known_parsers, $new; + $done++; + } + + # Otherwise (not in list), add at end of list. + if (!$done) { + push @$known_parsers, $new; + } + + return $class; +} + +sub save_parsers { + my $class = shift; + + # get directory from wherever XML::SAX is installed + my $dir = $INC{'XML/SAX.pm'}; + $dir = dirname($dir); + + my $file = File::Spec->catfile($dir, "SAX", PARSER_DETAILS); + chmod 0644, $file; + unlink($file); + + my $fh = gensym(); + open($fh, ">$file") || + die "Cannot write to $file: $!"; + + foreach my $p (@$known_parsers) { + print $fh "[$p->{Name}]\n"; + foreach my $key (keys %{$p->{Features}}) { + print $fh "$key = $p->{Features}{$key}\n"; + } + print $fh "\n"; + } + + print $fh "\n"; + + close $fh; + + return $class; +} + +sub do_warn { + my $class = shift; + # Don't output warnings if running under Test::Harness + warn(@_) unless $ENV{HARNESS_ACTIVE}; +} + +1; +__END__ + +=head1 NAME + +XML::SAX - Simple API for XML + +=head1 SYNOPSIS + + use XML::SAX; + + # get a list of known parsers + my $parsers = XML::SAX->parsers(); + + # add/update a parser + XML::SAX->add_parser(q(XML::SAX::PurePerl)); + + # remove parser + XML::SAX->remove_parser(q(XML::SAX::Foodelberry)); + + # save parsers + XML::SAX->save_parsers(); + +=head1 DESCRIPTION + +XML::SAX is a SAX parser access API for Perl. It includes classes +and APIs required for implementing SAX drivers, along with a factory +class for returning any SAX parser installed on the user's system. + +=head1 USING A SAX2 PARSER + +The factory class is XML::SAX::ParserFactory. Please see the +documentation of that module for how to instantiate a SAX parser: +L. However if you don't want to load up +another manual page, here's a short synopsis: + + use XML::SAX::ParserFactory; + use XML::SAX::XYZHandler; + my $handler = XML::SAX::XYZHandler->new(); + my $p = XML::SAX::ParserFactory->parser(Handler => $handler); + $p->parse_uri("foo.xml"); + # or $p->parse_string("") or $p->parse_file($fh); + +This will automatically load a SAX2 parser (defaulting to +XML::SAX::PurePerl if no others are found) and return it to you. + +In order to learn how to use SAX to parse XML, you will need to read +L and for reference, L. + +=head1 WRITING A SAX2 PARSER + +The first thing to remember in writing a SAX2 parser is to subclass +XML::SAX::Base. This will make your life infinitely easier, by providing +a number of methods automagically for you. See L for more +details. + +When writing a SAX2 parser that is compatible with XML::SAX, you need +to inform XML::SAX of the presence of that driver when you install it. +In order to do that, XML::SAX contains methods for saving the fact that +the parser exists on your system to a "INI" file, which is then loaded +to determine which parsers are installed. + +The best way to do this is to follow these rules: + +=over 4 + +=item * Add XML::SAX as a prerequisite in Makefile.PL: + + WriteMakefile( + ... + PREREQ_PM => { 'XML::SAX' => 0 }, + ... + ); + +Alternatively you may wish to check for it in other ways that will +cause more than just a warning. + +=item * Add the following code snippet to your Makefile.PL: + + sub MY::install { + package MY; + my $script = shift->SUPER::install(@_); + if (ExtUtils::MakeMaker::prompt( + "Do you want to modify ParserDetails.ini?", 'Y') + =~ /^y/i) { + $script =~ s/install :: (.*)$/install :: $1 install_sax_driver/m; + $script .= <<"INSTALL"; + + install_sax_driver : + \t\@\$(PERL) -MXML::SAX -e "XML::SAX->add_parser(q(\$(NAME)))->save_parsers()" + + INSTALL + } + return $script; + } + +Note that you should check the output of this - \$(NAME) will use the name of +your distribution, which may not be exactly what you want. For example XML::LibXML +has a driver called XML::LibXML::SAX::Generator, which is used in place of +\$(NAME) in the above. + +=item * Add an XML::SAX test: + +A test file should be added to your t/ directory containing something like the +following: + + use Test; + BEGIN { plan tests => 3 } + use XML::SAX; + use XML::SAX::PurePerl::DebugHandler; + XML::SAX->add_parser(q(XML::SAX::MyDriver)); + local $XML::SAX::ParserPackage = 'XML::SAX::MyDriver'; + eval { + my $handler = XML::SAX::PurePerl::DebugHandler->new(); + ok($handler); + my $parser = XML::SAX::ParserFactory->parser(Handler => $handler); + ok($parser); + ok($parser->isa('XML::SAX::MyDriver'); + $parser->parse_string(""); + ok($handler->{seen}{start_element}); + }; + +=back + +=head1 EXPORTS + +By default, XML::SAX exports nothing into the caller's namespace. However you +can request the symbols C and C which are the +URIs for those features, allowing an easier way to request those features +via ParserFactory: + + use XML::SAX qw(Namespaces Validation); + my $factory = XML::SAX::ParserFactory->new(); + $factory->require_feature(Namespaces); + $factory->require_feature(Validation); + my $parser = $factory->parser(); + +=head1 AUTHOR + +Current maintainer: Grant McLean, grantm@cpan.org + +Originally written by: + +Matt Sergeant, matt@sergeant.org + +Kip Hampton, khampton@totalcinema.com + +Robin Berjon, robin@knowscape.com + +=head1 LICENSE + +This is free software, you may use it and distribute it under +the same terms as Perl itself. + +=head1 SEE ALSO + +L for writing SAX Filters and Parsers + +L for an XML parser written in 100% +pure perl. + +L for details on exception handling + +=cut + diff --git a/lib/XML/SAX/DocumentLocator.pm b/lib/XML/SAX/DocumentLocator.pm new file mode 100644 index 0000000..dba56d5 --- /dev/null +++ b/lib/XML/SAX/DocumentLocator.pm @@ -0,0 +1,134 @@ +# $Id$ + +package XML::SAX::DocumentLocator; +use strict; + +sub new { + my $class = shift; + my %object; + tie %object, $class, @_; + + return bless \%object, $class; +} + +sub TIEHASH { + my $class = shift; + my ($pubmeth, $sysmeth, $linemeth, $colmeth, $encmeth, $xmlvmeth) = @_; + return bless { + pubmeth => $pubmeth, + sysmeth => $sysmeth, + linemeth => $linemeth, + colmeth => $colmeth, + encmeth => $encmeth, + xmlvmeth => $xmlvmeth, + }, $class; +} + +sub FETCH { + my ($self, $key) = @_; + my $method; + if ($key eq 'PublicId') { + $method = $self->{pubmeth}; + } + elsif ($key eq 'SystemId') { + $method = $self->{sysmeth}; + } + elsif ($key eq 'LineNumber') { + $method = $self->{linemeth}; + } + elsif ($key eq 'ColumnNumber') { + $method = $self->{colmeth}; + } + elsif ($key eq 'Encoding') { + $method = $self->{encmeth}; + } + elsif ($key eq 'XMLVersion') { + $method = $self->{xmlvmeth}; + } + if ($method) { + my $value = $method->($key); + return $value; + } + return undef; +} + +sub EXISTS { + my ($self, $key) = @_; + if ($key =~ /^(PublicId|SystemId|LineNumber|ColumnNumber|Encoding|XMLVersion)$/) { + return 1; + } + return 0; +} + +sub STORE { + my ($self, $key, $value) = @_; +} + +sub DELETE { + my ($self, $key) = @_; +} + +sub CLEAR { + my ($self) = @_; +} + +sub FIRSTKEY { + my ($self) = @_; + # assignment resets. + $self->{keys} = { + PublicId => 1, + SystemId => 1, + LineNumber => 1, + ColumnNumber => 1, + Encoding => 1, + XMLVersion => 1, + }; + return each %{$self->{keys}}; +} + +sub NEXTKEY { + my ($self, $lastkey) = @_; + return each %{$self->{keys}}; +} + +1; +__END__ + +=head1 NAME + +XML::SAX::DocumentLocator - Helper class for document locators + +=head1 SYNOPSIS + + my $locator = XML::SAX::DocumentLocator->new( + sub { $object->get_public_id }, + sub { $object->get_system_id }, + sub { $reader->current_line }, + sub { $reader->current_column }, + sub { $reader->get_encoding }, + sub { $reader->get_xml_version }, + ); + +=head1 DESCRIPTION + +This module gives you a tied hash reference that calls the +specified closures when asked for PublicId, SystemId, +LineNumber and ColumnNumber. + +It is useful for writing SAX Parsers so that you don't have +to constantly update the line numbers in a hash reference on +the object you pass to set_document_locator(). See the source +code for XML::SAX::PurePerl for a usage example. + +=head1 API + +There is only 1 method: C. Simply pass it a list of +closures that when called will return the PublicId, the +SystemId, the LineNumber, the ColumnNumber, the Encoding +and the XMLVersion respectively. + +The closures are passed a single parameter, the key being +requested. But you're free to ignore that. + +=cut + diff --git a/lib/XML/SAX/Intro.pod b/lib/XML/SAX/Intro.pod new file mode 100644 index 0000000..33c77d4 --- /dev/null +++ b/lib/XML/SAX/Intro.pod @@ -0,0 +1,407 @@ +=head1 NAME + +XML::SAX::Intro - An Introduction to SAX Parsing with Perl + +=head1 Introduction + +XML::SAX is a new way to work with XML Parsers in Perl. In this article +we'll discuss why you should be using SAX, why you should be using +XML::SAX, and we'll see some of the finer implementation details. The +text below assumes some familiarity with callback, or push based +parsing, but if you are unfamiliar with these techniques then a good +place to start is Kip Hampton's excellent series of articles on XML.com. + +=head1 Replacing XML::Parser + +The de-facto way of parsing XML under perl is to use Larry Wall and +Clark Cooper's XML::Parser. This module is a Perl and XS wrapper around +the expat XML parser library by James Clark. It has been a hugely +successful project, but suffers from a couple of rather major flaws. +Firstly it is a proprietary API, designed before the SAX API was +conceived, which means that it is not easily replaceable by other +streaming parsers. Secondly it's callbacks are subrefs. This doesn't +sound like much of an issue, but unfortunately leads to code like: + + sub handle_start { + my ($e, $el, %attrs) = @_; + if ($el eq 'foo') { + $e->{inside_foo}++; # BAD! $e is an XML::Parser::Expat object. + } + } + +As you can see, we're using the $e object to hold our state +information, which is a bad idea because we don't own that object - we +didn't create it. It's an internal object of XML::Parser, that happens +to be a hashref. We could all too easily overwrite XML::Parser internal +state variables by using this, or Clark could change it to an array ref +(not that he would, because it would break so much code, but he could). + +The only way currently with XML::Parser to safely maintain state is to +use a closure: + + my $state = MyState->new(); + $parser->setHandlers(Start => sub { handle_start($state, @_) }); + +This closure traps the $state variable, which now gets passed as the +first parameter to your callback. Unfortunately very few people use +this technique, as it is not documented in the XML::Parser POD files. + +Another reason you might not want to use XML::Parser is because you +need some feature that it doesn't provide (such as validation), or you +might need to use a library that doesn't use expat, due to it not being +installed on your system, or due to having a restrictive ISP. Using SAX +allows you to work around these restrictions. + +=head1 Introducing SAX + +SAX stands for the Simple API for XML. And simple it really is. +Constructing a SAX parser and passing events to handlers is done as +simply as: + + use XML::SAX; + use MySAXHandler; + + my $parser = XML::SAX::ParserFactory->parser( + Handler => MySAXHandler->new + ); + + $parser->parse_uri("foo.xml"); + +The important concept to grasp here is that SAX uses a factory class +called XML::SAX::ParserFactory to create a new parser instance. The +reason for this is so that you can support other underlying +parser implementations for different feature sets. This is one thing +that XML::Parser has always sorely lacked. + +In the code above we see the parse_uri method used, but we could +have equally well +called parse_file, parse_string, or parse(). Please see XML::SAX::Base +for what these methods take as parameters, but don't be fooled into +believing parse_file takes a filename. No, it takes a file handle, a +glob, or a subclass of IO::Handle. Beware. + +SAX works very similarly to XML::Parser's default callback method, +except it has one major difference: rather than setting individual +callbacks, you create a new class in which to receive the callbacks. +Each callback is called as a method call on an instance of that handler +class. An example will best demonstrate this: + + package MySAXHandler; + use base qw(XML::SAX::Base); + + sub start_document { + my ($self, $doc) = @_; + # process document start event + } + + sub start_element { + my ($self, $el) = @_; + # process element start event + } + +Now, when we instantiate this as above, and parse some XML with this as +the handler, the methods start_document and start_element will be +called as method calls, so this would be the equivalent of directly +calling: + + $object->start_element($el); + +Notice how this is different to XML::Parser's calling style, which +calls: + + start_element($e, $name, %attribs); + +It's the difference between function calling and method calling which +allows you to subclass SAX handlers which contributes to SAX being a +powerful solution. + +As you can see, unlike XML::Parser, we have to define a new package in +which to do our processing (there are hacks you can do to make this +uneccessary, but I'll leave figuring those out to the experts). The +biggest benefit of this is that you maintain your own state variable +($self in the above example) thus freeing you of the concerns listed +above. It is also an improvement in maintainability - you can place the +code in a separate file if you wish to, and your callback methods are +always called the same thing, rather than having to choose a suitable +name for them as you had to with XML::Parser. This is an obvious win. + +SAX parsers are also very flexible in how you pass a handler to them. +You can use a constructor parameter as we saw above, or we can pass the +handler directly in the call to one of the parse methods: + + $parser->parse(Handler => $handler, + Source => { SystemId => "foo.xml" }); + # or... + $parser->parse_file($fh, Handler => $handler); + +This flexibility allows for one parser to be used in many different +scenarios throughout your script (though one shouldn't feel pressure to +use this method, as parser construction is generally not a time +consuming process). + +=head1 Callback Parameters + +The only other thing you need to know to understand basic SAX is the +structure of the parameters passed to each of the callbacks. In +XML::Parser, all parameters are passed as multiple options to the +callbacks, so for example the Start callback would be called as +my_start($e, $name, %attributes), and the PI callback would be called +as my_processing_instruction($e, $target, $data). In SAX, every +callback is passed a hash reference, containing entries that define our +"node". The key callbacks and the structures they receive are: + +=head2 start_element + +The start_element handler is called whenever a parser sees an opening +tag. It is passed an element structure consisting of: + +=over 4 + +=item LocalName + +The name of the element minus any namespace prefix it may +have come with in the document. + +=item NamespaceURI + +The URI of the namespace associated with this element, +or the empty string for none. + +=item Attributes + +A set of attributes as described below. + +=item Name + +The name of the element as it was seen in the document (i.e. +including any prefix associated with it) + +=item Prefix + +The prefix used to qualify this element's namespace, or the +empty string if none. + +=back + +The B are a hash reference, keyed by what we have called +"James Clark" notation. This means that the attribute name has been +expanded to include any associated namespace URI, and put together as +{ns}name, where "ns" is the expanded namespace URI of the attribute if +and only if the attribute had a prefix, and "name" is the LocalName of +the attribute. + +The value of each entry in the attributes hash is another hash +structure consisting of: + +=over 4 + +=item LocalName + +The name of the attribute minus any namespace prefix it may have +come with in the document. + +=item NamespaceURI + +The URI of the namespace associated with this attribute. If the +attribute had no prefix, then this consists of just the empty string. + +=item Name + +The attribute's name as it appeared in the document, including any +namespace prefix. + +=item Prefix + +The prefix used to qualify this attribute's namepace, or the +empty string if none. + +=item Value + +The value of the attribute. + +=back + +So a full example, as output by Data::Dumper might be: + + .... + +=head2 end_element + +The end_element handler is called either when a parser sees a closing +tag, or after start_element has been called for an empty element (do +note however that a parser may if it is so inclined call characters +with an empty string when it sees an empty element. There is no simple +way in SAX to determine if the parser in fact saw an empty element, a +start and end element with no content.. + +The end_element handler receives exactly the same structure as +start_element, minus the Attributes entry. One must note though that it +should not be a reference to the same data as start_element receives, +so you may change the values in start_element but this will not affect +the values later seen by end_element. + +=head2 characters + +The characters callback may be called in serveral circumstances. The +most obvious one is when seeing ordinary character data in the markup. +But it is also called for text in a CDATA section, and is also called +in other situations. A SAX parser has to make no guarantees whatsoever +about how many times it may call characters for a stretch of text in an +XML document - it may call once, or it may call once for every +character in the text. In order to work around this it is often +important for the SAX developer to use a bundling technique, where text +is gathered up and processed in one of the other callbacks. This is not +always necessary, but it is a worthwhile technique to learn, which we +will cover in XML::SAX::Advanced (when I get around to writing it). + +The characters handler is called with a very simple structure - a hash +reference consisting of just one entry: + +=over 4 + +=item Data + +The text data that was received. + +=back + +=head2 comment + +The comment callback is called for comment text. Unlike with +C, the comment callback *must* be invoked just once for an +entire comment string. It receives a single simple structure - a hash +reference containing just one entry: + +=over 4 + +=item Data + +The text of the comment. + +=back + +=head2 processing_instruction + +The processing instruction handler is called for all processing +instructions in the document. Note that these processing instructions +may appear before the document root element, or after it, or anywhere +where text and elements would normally appear within the document, +according to the XML specification. + +The handler is passed a structure containing just two entries: + +=over 4 + +=item Target + +The target of the processing instrcution + +=item Data + +The text data in the processing instruction. Can be an empty +string for a processing instruction that has no data element. +For example E?wiggle?E is a perfectly valid processing instruction. + +=back + +=head1 Tip of the iceberg + +What we have discussed above is really the tip of the SAX iceberg. And +so far it looks like there's not much of interest to SAX beyond what we +have seen with XML::Parser. But it does go much further than that, I +promise. + +People who hate Object Oriented code for the sake of it may be thinking +here that creating a new package just to parse something is a waste +when they've been parsing things just fine up to now using procedural +code. But there's reason to all this madness. And that reason is SAX +Filters. + +As you saw right at the very start, to let the parser know about our +class, we pass it an instance of our class as the Handler to the +parser. But now imagine what would happen if our class could also take +a Handler option, and simply do some processing and pass on our data +further down the line? That in a nutshell is how SAX filters work. It's +Unix pipes for the 21st century! + +There are two downsides to this. Number 1 - writing SAX filters can be +tricky. If you look into the future and read the advanced tutorial I'm +writing, you'll see that Handler can come in several shapes and sizes. +So making sure your filter does the right thing can be tricky. +Secondly, constructing complex filter chains can be difficult, and +simple thinking tells us that we only get one pass at our document, +when often we'll need more than that. + +Luckily though, those downsides have been fixed by the release of two +very cool modules. What's even better is that I didn't write either of +them! + +The first module is XML::SAX::Base. This is a VITAL SAX module that +acts as a base class for all SAX parsers and filters. It provides an +abstraction away from calling the handler methods, that makes sure your +filter or parser does the right thing, and it does it FAST. So, if you +ever need to write a SAX filter, which if you're processing XML -> XML, +or XML -> HTML, then you probably do, then you need to be writing it as +a subclass of XML::SAX::Base. Really - this is advice not to ignore +lightly. I will not go into the details of writing a SAX filter here. +Kip Hampton, the author of XML::SAX::Base has covered this nicely in +his article on XML.com here . + +To construct SAX pipelines, Barrie Slaymaker, a long time Perl hacker +whose modules you will probably have heard of or used, wrote a very +clever module called XML::SAX::Machines. This combines some really +clever SAX filter-type modules, with a construction toolkit for filters +that makes building pipelines easy. But before we see how it makes +things easy, first lets see how tricky it looks to build complex SAX +filter pipelines. + + use XML::SAX::ParserFactory; + use XML::Filter::Filter1; + use XML::Filter::Filter2; + use XML::SAX::Writer; + + my $output_string; + my $writer = XML::SAX::Writer->new(Output => \$output_string); + my $filter2 = XML::SAX::Filter2->new(Handler => $writer); + my $filter1 = XML::SAX::Filter1->new(Handler => $filter2); + my $parser = XML::SAX::ParserFactory->parser(Handler => $filter1); + + $parser->parse_uri("foo.xml"); + +This is a lot easier with XML::SAX::Machines: + + use XML::SAX::Machines qw(Pipeline); + + my $output_string; + my $parser = Pipeline( + XML::SAX::Filter1 => XML::SAX::Filter2 => \$output_string + ); + + $parser->parse_uri("foo.xml"); + +One of the main benefits of XML::SAX::Machines is that the pipelines +are constructed in natural order, rather than the reverse order we saw +with manual pipeline construction. XML::SAX::Machines takes care of all +the internals of pipe construction, providing you at the end with just +a parser you can use (and you can re-use the same parser as many times +as you need to). + +Just a final tip. If you ever get stuck and are confused about what is +being passed from one SAX filter or parser to the next, then +Devel::TraceSAX will come to your rescue. This perl debugger plugin +will allow you to dump the SAX stream of events as it goes by. Usage is +really very simple just call your perl script that uses SAX as follows: + + $ perl -d:TraceSAX + +And preferably pipe the output to a pager of some sort, such as more or +less. The output is extremely verbose, but should help clear some +issues up. + +=head1 AUTHOR + +Matt Sergeant, matt@sergeant.org + +$Id$ + +=cut diff --git a/lib/XML/SAX/ParserFactory.pm b/lib/XML/SAX/ParserFactory.pm new file mode 100644 index 0000000..dd3ffec --- /dev/null +++ b/lib/XML/SAX/ParserFactory.pm @@ -0,0 +1,230 @@ +# $Id$ + +package XML::SAX::ParserFactory; + +use strict; +use vars qw($VERSION); + +$VERSION = '1.02'; + +use Symbol qw(gensym); +use XML::SAX; +use XML::SAX::Exception; + +sub new { + my $class = shift; + my %params = @_; # TODO : Fix this in spec. + my $self = bless \%params, $class; + $self->{KnownParsers} = XML::SAX->parsers(); + return $self; +} + +sub parser { + my $self = shift; + my @parser_params = @_; + if (!ref($self)) { + $self = $self->new(); + } + + my $parser_class = $self->_parser_class(); + + my $version = ''; + if ($parser_class =~ s/\s*\(([\d\.]+)\)\s*$//) { + $version = " $1"; + } + + if (!$parser_class->can('new')) { + eval "require $parser_class $version;"; + die $@ if $@; + } + + return $parser_class->new(@parser_params); +} + +sub require_feature { + my $self = shift; + my ($feature) = @_; + $self->{RequiredFeatures}{$feature}++; + return $self; +} + +sub _parser_class { + my $self = shift; + + # First try ParserPackage + if ($XML::SAX::ParserPackage) { + return $XML::SAX::ParserPackage; + } + + # Now check if required/preferred is there + if ($self->{RequiredFeatures}) { + my %required = %{$self->{RequiredFeatures}}; + # note - we never go onto the next try (ParserDetails.ini), + # because if we can't provide the requested feature + # we need to throw an exception. + PARSER: + foreach my $parser (reverse @{$self->{KnownParsers}}) { + foreach my $feature (keys %required) { + if (!exists $parser->{Features}{$feature}) { + next PARSER; + } + } + # got here - all features must exist! + return $parser->{Name}; + } + # TODO : should this be NotSupported() ? + throw XML::SAX::Exception ( + Message => "Unable to provide required features", + ); + } + + # Next try SAX.ini + for my $dir (@INC) { + my $fh = gensym(); + if (open($fh, "$dir/SAX.ini")) { + my $param_list = XML::SAX->_parse_ini_file($fh); + my $params = $param_list->[0]->{Features}; + if ($params->{ParserPackage}) { + return $params->{ParserPackage}; + } + else { + # we have required features (or nothing?) + PARSER: + foreach my $parser (reverse @{$self->{KnownParsers}}) { + foreach my $feature (keys %$params) { + if (!exists $parser->{Features}{$feature}) { + next PARSER; + } + } + return $parser->{Name}; + } + XML::SAX->do_warn("Unable to provide SAX.ini required features. Using fallback\n"); + } + last; # stop after first INI found + } + } + + if (@{$self->{KnownParsers}}) { + return $self->{KnownParsers}[-1]{Name}; + } + else { + return "XML::SAX::PurePerl"; # backup plan! + } +} + +1; +__END__ + +=head1 NAME + +XML::SAX::ParserFactory - Obtain a SAX parser + +=head1 SYNOPSIS + + use XML::SAX::ParserFactory; + use XML::SAX::XYZHandler; + my $handler = XML::SAX::XYZHandler->new(); + my $p = XML::SAX::ParserFactory->parser(Handler => $handler); + $p->parse_uri("foo.xml"); + # or $p->parse_string("") or $p->parse_file($fh); + +=head1 DESCRIPTION + +XML::SAX::ParserFactory is a factory class for providing an application +with a Perl SAX2 XML parser. It is akin to DBI - a front end for other +parser classes. Each new SAX2 parser installed will register itself +with XML::SAX, and then it will become available to all applications +that use XML::SAX::ParserFactory to obtain a SAX parser. + +Unlike DBI however, XML/SAX parsers almost all work alike (especially +if they subclass XML::SAX::Base, as they should), so rather than +specifying the parser you want in the call to C, XML::SAX +has several ways to automatically choose which parser to use: + +=over 4 + +=item * $XML::SAX::ParserPackage + +If this package variable is set, then this package is Cd +and an instance of this package is returned by calling the C +class method in that package. If it cannot be loaded or there is +an error, an exception will be thrown. The variable can also contain +a version number: + + $XML::SAX::ParserPackage = "XML::SAX::Expat (0.72)"; + +And the number will be treated as a minimum version number. + +=item * Required features + +It is possible to require features from the parsers. For example, you +may wish for a parser that supports validation via a DTD. To do that, +use the following code: + + use XML::SAX::ParserFactory; + my $factory = XML::SAX::ParserFactory->new(); + $factory->require_feature('http://xml.org/sax/features/validation'); + my $parser = $factory->parser(...); + +Alternatively, specify the required features in the call to the +ParserFactory constructor: + + my $factory = XML::SAX::ParserFactory->new( + RequiredFeatures => { + 'http://xml.org/sax/features/validation' => 1, + } + ); + +If the features you have asked for are unavailable (for example the +user might not have a validating parser installed), then an +exception will be thrown. + +The list of known parsers is searched in reverse order, so it will +always return the last installed parser that supports all of your +requested features (Note: this is subject to change if someone +comes up with a better way of making this work). + +=item * SAX.ini + +ParserFactory will search @INC for a file called SAX.ini, which +is in a simple format: + + # a comment looks like this, + ; or like this, and are stripped anywhere in the file + key = value # SAX.in contains key/value pairs. + +All whitespace is non-significant. + +This file can contain either a line: + + ParserPackage = MyParserModule (1.02) + +Where MyParserModule is the module to load and use for the parser, +and the number in brackets is a minimum version to load. + +Or you can list required features: + + http://xml.org/sax/features/validation = 1 + +And each feature with a true value will be required. + +=item * Fallback + +If none of the above works, the last parser installed on the user's +system will be used. The XML::SAX package ships with a pure perl +XML parser, XML::SAX::PurePerl, so that there will always be a +fallback parser. + +=back + +=head1 AUTHOR + +Matt Sergeant, matt@sergeant.org + +=head1 LICENSE + +This is free software, you may use it and distribute it under the same +terms as Perl itself. + +=cut + diff --git a/lib/XML/SAX/PurePerl.pm b/lib/XML/SAX/PurePerl.pm new file mode 100644 index 0000000..7940736 --- /dev/null +++ b/lib/XML/SAX/PurePerl.pm @@ -0,0 +1,751 @@ +# $Id$ + +package XML::SAX::PurePerl; + +use strict; +use vars qw/$VERSION/; + +$VERSION = '1.02'; + +use XML::SAX::PurePerl::Productions qw($NameChar $SingleChar); +use XML::SAX::PurePerl::Reader; +use XML::SAX::PurePerl::EncodingDetect (); +use XML::SAX::Exception; +use XML::SAX::PurePerl::DocType (); +use XML::SAX::PurePerl::DTDDecls (); +use XML::SAX::PurePerl::XMLDecl (); +use XML::SAX::DocumentLocator (); +use XML::SAX::Base (); +use XML::SAX qw(Namespaces); +use XML::NamespaceSupport (); +use IO::File; + +if ($] < 5.006) { + require XML::SAX::PurePerl::NoUnicodeExt; +} +else { + require XML::SAX::PurePerl::UnicodeExt; +} + +use vars qw(@ISA); +@ISA = ('XML::SAX::Base'); + +my %int_ents = ( + amp => '&', + lt => '<', + gt => '>', + quot => '"', + apos => "'", + ); + +my $xmlns_ns = "http://www.w3.org/2000/xmlns/"; +my $xml_ns = "http://www.w3.org/XML/1998/namespace"; + +use Carp; +sub _parse_characterstream { + my $self = shift; + my ($fh) = @_; + confess("CharacterStream is not yet correctly implemented"); + my $reader = XML::SAX::PurePerl::Reader::Stream->new($fh); + return $self->_parse($reader); +} + +sub _parse_bytestream { + my $self = shift; + my ($fh) = @_; + my $reader = XML::SAX::PurePerl::Reader::Stream->new($fh); + return $self->_parse($reader); +} + +sub _parse_string { + my $self = shift; + my ($str) = @_; + my $reader = XML::SAX::PurePerl::Reader::String->new($str); + return $self->_parse($reader); +} + +sub _parse_systemid { + my $self = shift; + my ($uri) = @_; + my $reader = XML::SAX::PurePerl::Reader::URI->new($uri); + return $self->_parse($reader); +} + +sub _parse { + my ($self, $reader) = @_; + + $reader->public_id($self->{ParseOptions}{Source}{PublicId}); + $reader->system_id($self->{ParseOptions}{Source}{SystemId}); + + $self->{NSHelper} = XML::NamespaceSupport->new({xmlns => 1}); + + $self->set_document_locator( + XML::SAX::DocumentLocator->new( + sub { $reader->public_id }, + sub { $reader->system_id }, + sub { $reader->line }, + sub { $reader->column }, + sub { $reader->get_encoding }, + sub { $reader->get_xml_version }, + ), + ); + + $self->start_document({}); + + if (defined $self->{ParseOptions}{Source}{Encoding}) { + $reader->set_encoding($self->{ParseOptions}{Source}{Encoding}); + } + else { + $self->encoding_detect($reader); + } + + # parse a document + $self->document($reader); + + return $self->end_document({}); +} + +sub parser_error { + my $self = shift; + my ($error, $reader) = @_; + +# warn("parser error: $error from ", $reader->line, " : ", $reader->column, "\n"); + my $exception = XML::SAX::Exception::Parse->new( + Message => $error, + ColumnNumber => $reader->column, + LineNumber => $reader->line, + PublicId => $reader->public_id, + SystemId => $reader->system_id, + ); + + $self->fatal_error($exception); + $exception->throw; +} + +sub document { + my ($self, $reader) = @_; + + # document ::= prolog element Misc* + + $self->prolog($reader); + $self->element($reader) || + $self->parser_error("Document requires an element", $reader); + + while(length($reader->data)) { + $self->Misc($reader) || + $self->parser_error("Only Comments, PIs and whitespace allowed at end of document", $reader); + } +} + +sub prolog { + my ($self, $reader) = @_; + + $self->XMLDecl($reader); + + # consume all misc bits + 1 while($self->Misc($reader)); + + if ($self->doctypedecl($reader)) { + while (length($reader->data)) { + $self->Misc($reader) || last; + } + } +} + +sub element { + my ($self, $reader) = @_; + + return 0 unless $reader->match('<'); + + my $name = $self->Name($reader) || $self->parser_error("Invalid element name", $reader); + + my %attribs; + + while( my ($k, $v) = $self->Attribute($reader) ) { + $attribs{$k} = $v; + } + + my $have_namespaces = $self->get_feature(Namespaces); + + # Namespace processing + $self->{NSHelper}->push_context; + my @new_ns; +# my %attrs = @attribs; +# while (my ($k,$v) = each %attrs) { + if ($have_namespaces) { + while ( my ($k, $v) = each %attribs ) { + if ($k =~ m/^xmlns(:(.*))?$/) { + my $prefix = $2 || ''; + $self->{NSHelper}->declare_prefix($prefix, $v); + my $ns = + { + Prefix => $prefix, + NamespaceURI => $v, + }; + push @new_ns, $ns; + $self->SUPER::start_prefix_mapping($ns); + } + } + } + + # Create element object and fire event + my %attrib_hash; + while (my ($name, $value) = each %attribs ) { + # TODO normalise value here + my ($ns, $prefix, $lname); + if ($have_namespaces) { + ($ns, $prefix, $lname) = $self->{NSHelper}->process_attribute_name($name); + } + $ns ||= ''; $prefix ||= ''; $lname ||= ''; + $attrib_hash{"{$ns}$lname"} = { + Name => $name, + LocalName => $lname, + Prefix => $prefix, + NamespaceURI => $ns, + Value => $value, + }; + } + + %attribs = (); # lose the memory since we recurse deep + + my ($ns, $prefix, $lname); + if ($self->get_feature(Namespaces)) { + ($ns, $prefix, $lname) = $self->{NSHelper}->process_element_name($name); + } + else { + $lname = $name; + } + $ns ||= ''; $prefix ||= ''; $lname ||= ''; + + # Process remainder of start_element + $self->skip_whitespace($reader); + my $have_content; + my $data = $reader->data(2); + if ($data =~ /^\/>/) { + $reader->move_along(2); + } + else { + $data =~ /^>/ or $self->parser_error("No close element tag", $reader); + $reader->move_along(1); + $have_content++; + } + + my $el = + { + Name => $name, + LocalName => $lname, + Prefix => $prefix, + NamespaceURI => $ns, + Attributes => \%attrib_hash, + }; + $self->start_element($el); + + # warn("($name\n"); + + if ($have_content) { + $self->content($reader); + + my $data = $reader->data(2); + $data =~ /^<\// or $self->parser_error("No close tag marker", $reader); + $reader->move_along(2); + my $end_name = $self->Name($reader); + $end_name eq $name || $self->parser_error("End tag mismatch ($end_name != $name)", $reader); + $self->skip_whitespace($reader); + $reader->match('>') or $self->parser_error("No close '>' on end tag", $reader); + } + + my %end_el = %$el; + delete $end_el{Attributes}; + $self->end_element(\%end_el); + + for my $ns (@new_ns) { + $self->end_prefix_mapping($ns); + } + $self->{NSHelper}->pop_context; + + return 1; +} + +sub content { + my ($self, $reader) = @_; + + while (1) { + $self->CharData($reader); + + my $data = $reader->data(2); + + if ($data =~ /^<\//) { + return 1; + } + elsif ($data =~ /^&/) { + $self->Reference($reader) or $self->parser_error("bare & not allowed in content", $reader); + next; + } + elsif ($data =~ /^CDSect($reader) + or + $self->Comment($reader)) + and next; + } + elsif ($data =~ /^<\?/) { + $self->PI($reader) and next; + } + elsif ($data =~ /^element($reader) and next; + } + last; + } + + return 1; +} + +sub CDSect { + my ($self, $reader) = @_; + + my $data = $reader->data(9); + return 0 unless $data =~ /^move_along(9); + + $self->start_cdata({}); + + $data = $reader->data; + while (1) { + $self->parser_error("EOF looking for CDATA section end", $reader) + unless length($data); + + if ($data =~ /^(.*?)\]\]>/s) { + my $chars = $1; + $reader->move_along(length($chars) + 3); + $self->characters({Data => $chars}); + last; + } + else { + $self->characters({Data => $data}); + $reader->move_along(length($data)); + $data = $reader->data; + } + } + $self->end_cdata({}); + return 1; +} + +sub CharData { + my ($self, $reader) = @_; + + my $data = $reader->data; + + while (1) { + return unless length($data); + + if ($data =~ /^([^<&]*)[<&]/s) { + my $chars = $1; + $self->parser_error("String ']]>' not allowed in character data", $reader) + if $chars =~ /\]\]>/; + $reader->move_along(length($chars)); + $self->characters({Data => $chars}) if length($chars); + last; + } + else { + $self->characters({Data => $data}); + $reader->move_along(length($data)); + $data = $reader->data; + } + } +} + +sub Misc { + my ($self, $reader) = @_; + if ($self->Comment($reader)) { + return 1; + } + elsif ($self->PI($reader)) { + return 1; + } + elsif ($self->skip_whitespace($reader)) { + return 1; + } + + return 0; +} + +sub Reference { + my ($self, $reader) = @_; + + return 0 unless $reader->match('&'); + + my $data = $reader->data; + + # Fetch more data if we have an incomplete numeric reference + if ($data =~ /^(#\d*|#x[0-9a-fA-F]*)$/) { + $data = $reader->data(length($data) + 6); + } + + if ($data =~ /^#x([0-9a-fA-F]+);/) { + my $ref = $1; + $reader->move_along(length($ref) + 3); + my $char = chr_ref(hex($ref)); + $self->parser_error("Character reference &#$ref; refers to an illegal XML character ($char)", $reader) + unless $char =~ /$SingleChar/o; + $self->characters({ Data => $char }); + return 1; + } + elsif ($data =~ /^#([0-9]+);/) { + my $ref = $1; + $reader->move_along(length($ref) + 2); + my $char = chr_ref($ref); + $self->parser_error("Character reference &#$ref; refers to an illegal XML character ($char)", $reader) + unless $char =~ /$SingleChar/o; + $self->characters({ Data => $char }); + return 1; + } + else { + # EntityRef + my $name = $self->Name($reader) + || $self->parser_error("Invalid name in entity", $reader); + $reader->match(';') or $self->parser_error("No semi-colon found after entity name", $reader); + + # warn("got entity: \&$name;\n"); + + # expand it + if ($self->_is_entity($name)) { + + if ($self->_is_external($name)) { + my $value = $self->_get_entity($name); + my $ent_reader = XML::SAX::PurePerl::Reader::URI->new($value); + $self->encoding_detect($ent_reader); + $self->extParsedEnt($ent_reader); + } + else { + my $value = $self->_stringify_entity($name); + my $ent_reader = XML::SAX::PurePerl::Reader::String->new($value); + $self->content($ent_reader); + } + return 1; + } + elsif ($name =~ /^(?:amp|gt|lt|quot|apos)$/) { + $self->characters({ Data => $int_ents{$name} }); + return 1; + } + else { + $self->parser_error("Undeclared entity", $reader); + } + } +} + +sub AttReference { + my ($self, $name, $reader) = @_; + if ($name =~ /^#x([0-9a-fA-F]+)$/) { + my $chr = chr_ref(hex($1)); + $chr =~ /$SingleChar/o or $self->parser_error("Character reference '&$name;' refers to an illegal XML character", $reader); + return $chr; + } + elsif ($name =~ /^#([0-9]+)$/) { + my $chr = chr_ref($1); + $chr =~ /$SingleChar/o or $self->parser_error("Character reference '&$name;' refers to an illegal XML character", $reader); + return $chr; + } + else { + if ($self->_is_entity($name)) { + if ($self->_is_external($name)) { + $self->parser_error("No external entity references allowed in attribute values", $reader); + } + else { + my $value = $self->_stringify_entity($name); + return $value; + } + } + elsif ($name =~ /^(?:amp|lt|gt|quot|apos)$/) { + return $int_ents{$name}; + } + else { + $self->parser_error("Undeclared entity '$name'", $reader); + } + } +} + +sub extParsedEnt { + my ($self, $reader) = @_; + + $self->TextDecl($reader); + $self->content($reader); +} + +sub _is_external { + my ($self, $name) = @_; +# TODO: Fix this to use $reader to store the entities perhaps. + if ($self->{ParseOptions}{external_entities}{$name}) { + return 1; + } + return ; +} + +sub _is_entity { + my ($self, $name) = @_; +# TODO: ditto above + if (exists $self->{ParseOptions}{entities}{$name}) { + return 1; + } + return 0; +} + +sub _stringify_entity { + my ($self, $name) = @_; +# TODO: ditto above + if (exists $self->{ParseOptions}{expanded_entity}{$name}) { + return $self->{ParseOptions}{expanded_entity}{$name}; + } + # expand + my $reader = XML::SAX::PurePerl::Reader::URI->new($self->{ParseOptions}{entities}{$name}); + my $ent = ''; + while(1) { + my $data = $reader->data; + $ent .= $data; + $reader->move_along(length($data)) or last; + } + return $self->{ParseOptions}{expanded_entity}{$name} = $ent; +} + +sub _get_entity { + my ($self, $name) = @_; +# TODO: ditto above + return $self->{ParseOptions}{entities}{$name}; +} + +sub skip_whitespace { + my ($self, $reader) = @_; + + my $data = $reader->data; + + my $found = 0; + while ($data =~ s/^([\x20\x0A\x0D\x09]*)//) { + last unless length($1); + $found++; + $reader->move_along(length($1)); + $data = $reader->data; + } + + return $found; +} + +sub Attribute { + my ($self, $reader) = @_; + + $self->skip_whitespace($reader) || return; + + my $data = $reader->data(2); + return if $data =~ /^\/?>/; + + if (my $name = $self->Name($reader)) { + $self->skip_whitespace($reader); + $reader->match('=') or $self->parser_error("No '=' in Attribute", $reader); + $self->skip_whitespace($reader); + my $value = $self->AttValue($reader); + + if (!$self->cdata_attrib($name)) { + $value =~ s/^\x20*//; # discard leading spaces + $value =~ s/\x20*$//; # discard trailing spaces + $value =~ s/ {1,}/ /g; # all >1 space to single space + } + + return $name, $value; + } + + return; +} + +sub cdata_attrib { + # TODO implement this! + return 1; +} + +sub AttValue { + my ($self, $reader) = @_; + + my $quote = $self->quote($reader); + + my $value = ''; + + while (1) { + my $data = $reader->data; + $self->parser_error("EOF found while looking for the end of attribute value", $reader) + unless length($data); + if ($data =~ /^([^$quote]*)$quote/) { + $reader->move_along(length($1) + 1); + $value .= $1; + last; + } + else { + $value .= $data; + $reader->move_along(length($data)); + } + } + + if ($value =~ /parser_error("< character not allowed in attribute values", $reader); + } + + $value =~ s/[\x09\x0A\x0D]/\x20/g; + $value =~ s/&(#(x[0-9a-fA-F]+)|#([0-9]+)|\w+);/$self->AttReference($1, $reader)/geo; + + return $value; +} + +sub Comment { + my ($self, $reader) = @_; + + my $data = $reader->data(4); + if ($data =~ /^/s) { + $comment_str .= $1; + $self->parser_error("Invalid comment (dash)", $reader) if $comment_str =~ /-$/; + $reader->move_along(length($1) + 3); + last; + } + else { + $comment_str .= $data; + $reader->move_along(length($data)); + } + } + + $self->comment({ Data => $comment_str }); + + return 1; + } + return 0; +} + +sub PI { + my ($self, $reader) = @_; + + my $data = $reader->data(2); + + if ($data =~ /^<\?/) { + $reader->move_along(2); + my ($target); + $target = $self->Name($reader) || + $self->parser_error("PI has no target", $reader); + + my $pi_data = ''; + if ($self->skip_whitespace($reader)) { + while (1) { + my $data = $reader->data; + $self->parser_error("End of data seen while looking for close PI marker", $reader) + unless length($data); + if ($data =~ /^(.*?)\?>/s) { + $pi_data .= $1; + $reader->move_along(length($1) + 2); + last; + } + else { + $pi_data .= $data; + $reader->move_along(length($data)); + } + } + } + else { + my $data = $reader->data(2); + $data =~ /^\?>/ or $self->parser_error("PI closing sequence not found", $reader); + $reader->move_along(2); + } + + $self->processing_instruction({ Target => $target, Data => $pi_data }); + + return 1; + } + return 0; +} + +sub Name { + my ($self, $reader) = @_; + + my $name = ''; + while(1) { + my $data = $reader->data; + return unless length($data); + $data =~ /^([^\s>\/&\?;=<\)\(\[\],\%\#\!\*\|]*)/ or return; + $name .= $1; + my $len = length($1); + $reader->move_along($len); + last if ($len != length($data)); + } + + return unless length($name); + + $name =~ /$NameChar/o or $self->parser_error("Name <$name> does not match NameChar production", $reader); + + return $name; +} + +sub quote { + my ($self, $reader) = @_; + + my $data = $reader->data; + + $data =~ /^(['"])/ or $self->parser_error("Invalid quote token", $reader); + $reader->move_along(1); + return $1; +} + +1; +__END__ + +=head1 NAME + +XML::SAX::PurePerl - Pure Perl XML Parser with SAX2 interface + +=head1 SYNOPSIS + + use XML::Handler::Foo; + use XML::SAX::PurePerl; + my $handler = XML::Handler::Foo->new(); + my $parser = XML::SAX::PurePerl->new(Handler => $handler); + $parser->parse_uri("myfile.xml"); + +=head1 DESCRIPTION + +This module implements an XML parser in pure perl. It is written around the +upcoming perl 5.8's unicode support and support for multiple document +encodings (using the PerlIO layer), however it has been ported to work with +ASCII/UTF8 documents under lower perl versions. + +The SAX2 API is described in detail at http://sourceforge.net/projects/perl-xml/, in +the CVS archive, under libxml-perl/docs. Hopefully those documents will be in a +better location soon. + +Please refer to the SAX2 documentation for how to use this module - it is merely a +front end to SAX2, and implements nothing that is not in that spec (or at least tries +not to - please email me if you find errors in this implementation). + +=head1 BUGS + +XML::SAX::PurePerl is B. Very slow. I suggest you use something else +in fact. However it is great as a fallback parser for XML::SAX, where the +user might not be able to install an XS based parser or C library. + +Currently lots, probably. At the moment the weakest area is parsing DOCTYPE declarations, +though the code is in place to start doing this. Also parsing parameter entity +references is causing me much confusion, since it's not exactly what I would call +trivial, or well documented in the XML grammar. XML documents with internal subsets +are likely to fail. + +I am however trying to work towards full conformance using the Oasis test suite. + +=head1 AUTHOR + +Matt Sergeant, matt@sergeant.org. Copyright 2001. + +Please report all bugs to the Perl-XML mailing list at perl-xml@listserv.activestate.com. + +=head1 LICENSE + +This is free software. You may use it or redistribute it under the same terms as +Perl 5.7.2 itself. + +=cut + diff --git a/lib/XML/SAX/PurePerl/DTDDecls.pm b/lib/XML/SAX/PurePerl/DTDDecls.pm new file mode 100644 index 0000000..eaa7825 --- /dev/null +++ b/lib/XML/SAX/PurePerl/DTDDecls.pm @@ -0,0 +1,603 @@ +# $Id$ + +package XML::SAX::PurePerl; + +use strict; +use XML::SAX::PurePerl::Productions qw($SingleChar); + +sub elementdecl { + my ($self, $reader) = @_; + + my $data = $reader->data(9); + return 0 unless $data =~ /^move_along(9); + + $self->skip_whitespace($reader) || + $self->parser_error("No whitespace after ELEMENT declaration", $reader); + + my $name = $self->Name($reader); + + $self->skip_whitespace($reader) || + $self->parser_error("No whitespace after ELEMENT's name", $reader); + + $self->contentspec($reader, $name); + + $self->skip_whitespace($reader); + + $reader->match('>') or $self->parser_error("Closing angle bracket not found on ELEMENT declaration", $reader); + + return 1; +} + +sub contentspec { + my ($self, $reader, $name) = @_; + + my $data = $reader->data(5); + + my $model; + if ($data =~ /^EMPTY/) { + $reader->move_along(5); + $model = 'EMPTY'; + } + elsif ($data =~ /^ANY/) { + $reader->move_along(3); + $model = 'ANY'; + } + else { + $model = $self->Mixed_or_children($reader); + } + + if ($model) { + # call SAX callback now. + $self->element_decl({Name => $name, Model => $model}); + return 1; + } + + $self->parser_error("contentspec not found in ELEMENT declaration", $reader); +} + +sub Mixed_or_children { + my ($self, $reader) = @_; + + my $data = $reader->data(8); + $data =~ /^\(/ or return; # $self->parser_error("No opening bracket in Mixed or children", $reader); + + if ($data =~ /^\(\s*\#PCDATA/) { + $reader->match('('); + $self->skip_whitespace($reader); + $reader->move_along(7); + my $model = $self->Mixed($reader); + return $model; + } + + # not matched - must be Children + return $self->children($reader); +} + +# Mixed ::= ( '(' S* PCDATA ( S* '|' S* QName )* S* ')' '*' ) +# | ( '(' S* PCDATA S* ')' ) +sub Mixed { + my ($self, $reader) = @_; + + # Mixed_or_children already matched '(' S* '#PCDATA' + + my $model = '(#PCDATA'; + + $self->skip_whitespace($reader); + + my %seen; + + while (1) { + last unless $reader->match('|'); + $self->skip_whitespace($reader); + + my $name = $self->Name($reader) || + $self->parser_error("No 'Name' after Mixed content '|'", $reader); + + if ($seen{$name}) { + $self->parser_error("Element '$name' has already appeared in this group", $reader); + } + $seen{$name}++; + + $model .= "|$name"; + + $self->skip_whitespace($reader); + } + + $reader->match(')') || $self->parser_error("no closing bracket on mixed content", $reader); + + $model .= ")"; + + if ($reader->match('*')) { + $model .= "*"; + } + + return $model; +} + +# [[47]] Children ::= ChoiceOrSeq Cardinality? +# [[48]] Cp ::= ( QName | ChoiceOrSeq ) Cardinality? +# ChoiceOrSeq ::= '(' S* Cp ( Choice | Seq )? S* ')' +# [[49]] Choice ::= ( S* '|' S* Cp )+ +# [[50]] Seq ::= ( S* ',' S* Cp )+ +# // Children ::= (Choice | Seq) Cardinality? +# // Cp ::= ( QName | Choice | Seq) Cardinality? +# // Choice ::= '(' S* Cp ( S* '|' S* Cp )+ S* ')' +# // Seq ::= '(' S* Cp ( S* ',' S* Cp )* S* ')' +# [[51]] Mixed ::= ( '(' S* PCDATA ( S* '|' S* QName )* S* ')' MixedCardinality ) +# | ( '(' S* PCDATA S* ')' ) +# Cardinality ::= '?' | '+' | '*' +# MixedCardinality ::= '*' +sub children { + my ($self, $reader) = @_; + + return $self->ChoiceOrSeq($reader) . $self->Cardinality($reader); +} + +sub ChoiceOrSeq { + my ($self, $reader) = @_; + + $reader->match('(') or $self->parser_error("choice/seq contains no opening bracket", $reader); + + my $model = '('; + + $self->skip_whitespace($reader); + + $model .= $self->Cp($reader); + + if (my $choice = $self->Choice($reader)) { + $model .= $choice; + } + else { + $model .= $self->Seq($reader); + } + + $self->skip_whitespace($reader); + + $reader->match(')') or $self->parser_error("choice/seq contains no closing bracket", $reader); + + $model .= ')'; + + return $model; +} + +sub Cardinality { + my ($self, $reader) = @_; + # cardinality is always optional + my $data = $reader->data; + if ($data =~ /^([\?\+\*])/) { + $reader->move_along(1); + return $1; + } + return ''; +} + +sub Cp { + my ($self, $reader) = @_; + + my $model; + my $name = eval + { + if (my $name = $self->Name($reader)) { + return $name . $self->Cardinality($reader); + } + }; + return $name if defined $name; + return $self->ChoiceOrSeq($reader) . $self->Cardinality($reader); +} + +sub Choice { + my ($self, $reader) = @_; + + my $model = ''; + $self->skip_whitespace($reader); + + while ($reader->match('|')) { + $self->skip_whitespace($reader); + $model .= '|'; + $model .= $self->Cp($reader); + $self->skip_whitespace($reader); + } + + return $model; +} + +sub Seq { + my ($self, $reader) = @_; + + my $model = ''; + $self->skip_whitespace($reader); + + while ($reader->match(',')) { + $self->skip_whitespace($reader); + my $cp = $self->Cp($reader); + if ($cp) { + $model .= ','; + $model .= $cp; + } + $self->skip_whitespace($reader); + } + + return $model; +} + +sub AttlistDecl { + my ($self, $reader) = @_; + + my $data = $reader->data(9); + if ($data =~ /^move_along(9); + + $self->skip_whitespace($reader) || + $self->parser_error("No whitespace after ATTLIST declaration", $reader); + my $name = $self->Name($reader); + + $self->AttDefList($reader, $name); + + $self->skip_whitespace($reader); + + $reader->match('>') or $self->parser_error("Closing angle bracket not found on ATTLIST declaration", $reader); + + return 1; + } + + return 0; +} + +sub AttDefList { + my ($self, $reader, $name) = @_; + + 1 while $self->AttDef($reader, $name); +} + +sub AttDef { + my ($self, $reader, $el_name) = @_; + + $self->skip_whitespace($reader) || return 0; + my $att_name = $self->Name($reader) || return 0; + $self->skip_whitespace($reader) || + $self->parser_error("No whitespace after Name in attribute definition", $reader); + my $att_type = $self->AttType($reader); + + $self->skip_whitespace($reader) || + $self->parser_error("No whitespace after AttType in attribute definition", $reader); + my ($mode, $value) = $self->DefaultDecl($reader); + + # fire SAX event here! + $self->attribute_decl({ + eName => $el_name, + aName => $att_name, + Type => $att_type, + Mode => $mode, + Value => $value, + }); + return 1; +} + +sub AttType { + my ($self, $reader) = @_; + + return $self->StringType($reader) || + $self->TokenizedType($reader) || + $self->EnumeratedType($reader) || + $self->parser_error("Can't match AttType", $reader); +} + +sub StringType { + my ($self, $reader) = @_; + + my $data = $reader->data(5); + return unless $data =~ /^CDATA/; + $reader->move_along(5); + return 'CDATA'; +} + +sub TokenizedType { + my ($self, $reader) = @_; + + my $data = $reader->data(8); + if ($data =~ /^(IDREFS?|ID|ENTITIES|ENTITY|NMTOKENS?)/) { + $reader->move_along(length($1)); + return $1; + } + return; +} + +sub EnumeratedType { + my ($self, $reader) = @_; + return $self->NotationType($reader) || $self->Enumeration($reader); +} + +sub NotationType { + my ($self, $reader) = @_; + + my $data = $reader->data(8); + return unless $data =~ /^NOTATION/; + $reader->move_along(8); + + $self->skip_whitespace($reader) || + $self->parser_error("No whitespace after NOTATION", $reader); + $reader->match('(') or $self->parser_error("No opening bracket in notation section", $reader); + + $self->skip_whitespace($reader); + my $model = 'NOTATION ('; + my $name = $self->Name($reader) || + $self->parser_error("No name in notation section", $reader); + $model .= $name; + $self->skip_whitespace($reader); + $data = $reader->data; + while ($data =~ /^\|/) { + $reader->move_along(1); + $model .= '|'; + $self->skip_whitespace($reader); + my $name = $self->Name($reader) || + $self->parser_error("No name in notation section", $reader); + $model .= $name; + $self->skip_whitespace($reader); + $data = $reader->data; + } + $data =~ /^\)/ or $self->parser_error("No closing bracket in notation section", $reader); + $reader->move_along(1); + + $model .= ')'; + + return $model; +} + +sub Enumeration { + my ($self, $reader) = @_; + + return unless $reader->match('('); + + $self->skip_whitespace($reader); + my $model = '('; + my $nmtoken = $self->Nmtoken($reader) || + $self->parser_error("No Nmtoken in enumerated declaration", $reader); + $model .= $nmtoken; + $self->skip_whitespace($reader); + my $data = $reader->data; + while ($data =~ /^\|/) { + $model .= '|'; + $reader->move_along(1); + $self->skip_whitespace($reader); + my $nmtoken = $self->Nmtoken($reader) || + $self->parser_error("No Nmtoken in enumerated declaration", $reader); + $model .= $nmtoken; + $self->skip_whitespace($reader); + $data = $reader->data; + } + $data =~ /^\)/ or $self->parser_error("No closing bracket in enumerated declaration", $reader); + $reader->move_along(1); + + $model .= ')'; + + return $model; +} + +sub Nmtoken { + my ($self, $reader) = @_; + return $self->Name($reader); +} + +sub DefaultDecl { + my ($self, $reader) = @_; + + my $data = $reader->data(9); + if ($data =~ /^(\#REQUIRED|\#IMPLIED)/) { + $reader->move_along(length($1)); + return $1; + } + my $model = ''; + if ($data =~ /^\#FIXED/) { + $reader->move_along(6); + $self->skip_whitespace($reader) || $self->parser_error( + "no whitespace after FIXED specifier", $reader); + my $value = $self->AttValue($reader); + return "#FIXED", $value; + } + my $value = $self->AttValue($reader); + return undef, $value; +} + +sub EntityDecl { + my ($self, $reader) = @_; + + my $data = $reader->data(8); + return 0 unless $data =~ /^move_along(8); + + $self->skip_whitespace($reader) || $self->parser_error( + "No whitespace after ENTITY declaration", $reader); + + $self->PEDecl($reader) || $self->GEDecl($reader); + + $self->skip_whitespace($reader); + + $reader->match('>') or $self->parser_error("No closing '>' in entity definition", $reader); + + return 1; +} + +sub GEDecl { + my ($self, $reader) = @_; + + my $name = $self->Name($reader) || $self->parser_error("No entity name given", $reader); + $self->skip_whitespace($reader) || $self->parser_error("No whitespace after entity name", $reader); + + # TODO: ExternalID calls lexhandler method. Wrong place for it. + my $value; + if ($value = $self->ExternalID($reader)) { + $value .= $self->NDataDecl($reader); + } + else { + $value = $self->EntityValue($reader); + } + + if ($self->{ParseOptions}{entities}{$name}) { + warn("entity $name already exists\n"); + } else { + $self->{ParseOptions}{entities}{$name} = 1; + $self->{ParseOptions}{expanded_entity}{$name} = $value; # ??? + } + # do callback? + return 1; +} + +sub PEDecl { + my ($self, $reader) = @_; + + return 0 unless $reader->match('%'); + + $self->skip_whitespace($reader) || $self->parser_error("No whitespace after parameter entity marker", $reader); + my $name = $self->Name($reader) || $self->parser_error("No parameter entity name given", $reader); + $self->skip_whitespace($reader) || $self->parser_error("No whitespace after parameter entity name", $reader); + my $value = $self->ExternalID($reader) || + $self->EntityValue($reader) || + $self->parser_error("PE is not a value or an external resource", $reader); + # do callback? + return 1; +} + +my $quotre = qr/[^%&\"]/; +my $aposre = qr/[^%&\']/; + +sub EntityValue { + my ($self, $reader) = @_; + + my $data = $reader->data; + my $quote = '"'; + my $re = $quotre; + if ($data !~ /^"/) { + $data =~ /^'/ or $self->parser_error("Not a quote character", $reader); + $quote = "'"; + $re = $aposre; + } + $reader->move_along(1); + + my $value = ''; + + while (1) { + my $data = $reader->data; + + $self->parser_error("EOF found while reading entity value", $reader) + unless length($data); + + if ($data =~ /^($re+)/) { + my $match = $1; + $value .= $match; + $reader->move_along(length($match)); + } + elsif ($reader->match('&')) { + # if it's a char ref, expand now: + if ($reader->match('#')) { + my $char; + my $ref = ''; + if ($reader->match('x')) { + my $data = $reader->data; + while (1) { + $self->parser_error("EOF looking for reference end", $reader) + unless length($data); + if ($data !~ /^([0-9a-fA-F]*)/) { + last; + } + $ref .= $1; + $reader->move_along(length($1)); + if (length($1) == length($data)) { + $data = $reader->data; + } + else { + last; + } + } + $char = chr_ref(hex($ref)); + $ref = "x$ref"; + } + else { + my $data = $reader->data; + while (1) { + $self->parser_error("EOF looking for reference end", $reader) + unless length($data); + if ($data !~ /^([0-9]*)/) { + last; + } + $ref .= $1; + $reader->move_along(length($1)); + if (length($1) == length($data)) { + $data = $reader->data; + } + else { + last; + } + } + $char = chr($ref); + } + $reader->match(';') || + $self->parser_error("No semi-colon found after character reference", $reader); + if ($char !~ $SingleChar) { # match a single character + $self->parser_error("Character reference '&#$ref;' refers to an illegal XML character ($char)", $reader); + } + $value .= $char; + } + else { + # entity refs in entities get expanded later, so don't parse now. + $value .= '&'; + } + } + elsif ($reader->match('%')) { + $value .= $self->PEReference($reader); + } + elsif ($reader->match($quote)) { + # end of attrib + last; + } + else { + $self->parser_error("Invalid character in attribute value: " . substr($reader->data, 0, 1), $reader); + } + } + + return $value; +} + +sub NDataDecl { + my ($self, $reader) = @_; + $self->skip_whitespace($reader) || return ''; + my $data = $reader->data(5); + return '' unless $data =~ /^NDATA/; + $reader->move_along(5); + $self->skip_whitespace($reader) || $self->parser_error("No whitespace after NDATA declaration", $reader); + my $name = $self->Name($reader) || $self->parser_error("NDATA declaration lacks a proper Name", $reader); + return " NDATA $name"; +} + +sub NotationDecl { + my ($self, $reader) = @_; + + my $data = $reader->data(10); + return 0 unless $data =~ /^move_along(10); + $self->skip_whitespace($reader) || + $self->parser_error("No whitespace after NOTATION declaration", $reader); + $data = $reader->data; + my $value = ''; + while(1) { + $self->parser_error("EOF found while looking for end of NotationDecl", $reader) + unless length($data); + + if ($data =~ /^([^>]*)>/) { + $value .= $1; + $reader->move_along(length($1) + 1); + $self->notation_decl({Name => "FIXME", SystemId => "FIXME", PublicId => "FIXME" }); + last; + } + else { + $value .= $data; + $reader->move_along(length($data)); + $data = $reader->data; + } + } + return 1; +} + +1; diff --git a/lib/XML/SAX/PurePerl/DebugHandler.pm b/lib/XML/SAX/PurePerl/DebugHandler.pm new file mode 100644 index 0000000..b80d060 --- /dev/null +++ b/lib/XML/SAX/PurePerl/DebugHandler.pm @@ -0,0 +1,95 @@ +# $Id$ + +package XML::SAX::PurePerl::DebugHandler; + +use strict; + +sub new { + my $class = shift; + my %opts = @_; + return bless \%opts, $class; +} + +# DocumentHandler + +sub set_document_locator { + my $self = shift; + print "set_document_locator\n" if $ENV{DEBUG_XML}; + $self->{seen}{set_document_locator}++; +} + +sub start_document { + my $self = shift; + print "start_document\n" if $ENV{DEBUG_XML}; + $self->{seen}{start_document}++; +} + +sub end_document { + my $self = shift; + print "end_document\n" if $ENV{DEBUG_XML}; + $self->{seen}{end_document}++; +} + +sub start_element { + my $self = shift; + print "start_element\n" if $ENV{DEBUG_XML}; + $self->{seen}{start_element}++; +} + +sub end_element { + my $self = shift; + print "end_element\n" if $ENV{DEBUG_XML}; + $self->{seen}{end_element}++; +} + +sub characters { + my $self = shift; + print "characters\n" if $ENV{DEBUG_XML}; +# warn "Char: ", $_[0]->{Data}, "\n"; + $self->{seen}{characters}++; +} + +sub processing_instruction { + my $self = shift; + print "processing_instruction\n" if $ENV{DEBUG_XML}; + $self->{seen}{processing_instruction}++; +} + +sub ignorable_whitespace { + my $self = shift; + print "ignorable_whitespace\n" if $ENV{DEBUG_XML}; + $self->{seen}{ignorable_whitespace}++; +} + +# LexHandler + +sub comment { + my $self = shift; + print "comment\n" if $ENV{DEBUG_XML}; + $self->{seen}{comment}++; +} + +# DTDHandler + +sub notation_decl { + my $self = shift; + print "notation_decl\n" if $ENV{DEBUG_XML}; + $self->{seen}{notation_decl}++; +} + +sub unparsed_entity_decl { + my $self = shift; + print "unparsed_entity_decl\n" if $ENV{DEBUG_XML}; + $self->{seen}{entity_decl}++; +} + +# EntityResolver + +sub resolve_entity { + my $self = shift; + print "resolve_entity\n" if $ENV{DEBUG_XML}; + $self->{seen}{resolve_entity}++; + return ''; +} + +1; diff --git a/lib/XML/SAX/PurePerl/DocType.pm b/lib/XML/SAX/PurePerl/DocType.pm new file mode 100644 index 0000000..71b77c2 --- /dev/null +++ b/lib/XML/SAX/PurePerl/DocType.pm @@ -0,0 +1,180 @@ +# $Id$ + +package XML::SAX::PurePerl; + +use strict; +use XML::SAX::PurePerl::Productions qw($PubidChar); + +sub doctypedecl { + my ($self, $reader) = @_; + + my $data = $reader->data(9); + if ($data =~ /^move_along(9); + $self->skip_whitespace($reader) || + $self->parser_error("No whitespace after doctype declaration", $reader); + + my $root_name = $self->Name($reader) || + $self->parser_error("Doctype declaration has no root element name", $reader); + + if ($self->skip_whitespace($reader)) { + # might be externalid... + my %dtd = $self->ExternalID($reader); + # TODO: Call SAX event + } + + $self->skip_whitespace($reader); + + $self->InternalSubset($reader); + + $reader->match('>') or $self->parser_error("Doctype not closed", $reader); + + return 1; + } + + return 0; +} + +sub ExternalID { + my ($self, $reader) = @_; + + my $data = $reader->data(6); + + if ($data =~ /^SYSTEM/) { + $reader->move_along(6); + $self->skip_whitespace($reader) || + $self->parser_error("No whitespace after SYSTEM identifier", $reader); + return (SYSTEM => $self->SystemLiteral($reader)); + } + elsif ($data =~ /^PUBLIC/) { + $reader->move_along(6); + $self->skip_whitespace($reader) || + $self->parser_error("No whitespace after PUBLIC identifier", $reader); + + my $quote = $self->quote($reader) || + $self->parser_error("Not a quote character in PUBLIC identifier", $reader); + + my $data = $reader->data; + my $pubid = ''; + while(1) { + $self->parser_error("EOF while looking for end of PUBLIC identifiier", $reader) + unless length($data); + + if ($data =~ /^([^$quote]*)$quote/) { + $pubid .= $1; + $reader->move_along(length($1) + 1); + last; + } + else { + $pubid .= $data; + $reader->move_along(length($data)); + $data = $reader->data; + } + } + + if ($pubid !~ /^($PubidChar)+$/) { + $self->parser_error("Invalid characters in PUBLIC identifier", $reader); + } + + $self->skip_whitespace($reader) || + $self->parser_error("Not whitespace after PUBLIC ID in DOCTYPE", $reader); + + return (PUBLIC => $pubid, + SYSTEM => $self->SystemLiteral($reader)); + } + else { + return; + } + + return 1; +} + +sub SystemLiteral { + my ($self, $reader) = @_; + + my $quote = $self->quote($reader); + + my $data = $reader->data; + my $systemid = ''; + while (1) { + $self->parser_error("EOF found while looking for end of System Literal", $reader) + unless length($data); + if ($data =~ /^([^$quote]*)$quote/) { + $systemid .= $1; + $reader->move_along(length($1) + 1); + return $systemid; + } + else { + $systemid .= $data; + $reader->move_along(length($data)); + $data = $reader->data; + } + } +} + +sub InternalSubset { + my ($self, $reader) = @_; + + return 0 unless $reader->match('['); + + 1 while $self->IntSubsetDecl($reader); + + $reader->match(']') or $self->parser_error("No close bracket on internal subset (found: " . $reader->data, $reader); + $self->skip_whitespace($reader); + return 1; +} + +sub IntSubsetDecl { + my ($self, $reader) = @_; + + return $self->DeclSep($reader) || $self->markupdecl($reader); +} + +sub DeclSep { + my ($self, $reader) = @_; + + if ($self->skip_whitespace($reader)) { + return 1; + } + + if ($self->PEReference($reader)) { + return 1; + } + +# if ($self->ParsedExtSubset($reader)) { +# return 1; +# } + + return 0; +} + +sub PEReference { + my ($self, $reader) = @_; + + return 0 unless $reader->match('%'); + + my $peref = $self->Name($reader) || + $self->parser_error("PEReference did not find a Name", $reader); + # TODO - load/parse the peref + + $reader->match(';') or $self->parser_error("Invalid token in PEReference", $reader); + return 1; +} + +sub markupdecl { + my ($self, $reader) = @_; + + if ($self->elementdecl($reader) || + $self->AttlistDecl($reader) || + $self->EntityDecl($reader) || + $self->NotationDecl($reader) || + $self->PI($reader) || + $self->Comment($reader)) + { + return 1; + } + + return 0; +} + +1; diff --git a/lib/XML/SAX/PurePerl/EncodingDetect.pm b/lib/XML/SAX/PurePerl/EncodingDetect.pm new file mode 100644 index 0000000..7ab4548 --- /dev/null +++ b/lib/XML/SAX/PurePerl/EncodingDetect.pm @@ -0,0 +1,105 @@ +# $Id$ + +package XML::SAX::PurePerl; # NB, not ::EncodingDetect! + +use strict; + +sub encoding_detect { + my ($parser, $reader) = @_; + + my $error = "Invalid byte sequence at start of file"; + + my $data = $reader->data; + if ($data =~ /^\x00\x00\xFE\xFF/) { + # BO-UCS4-be + $reader->move_along(4); + $reader->set_encoding('UCS-4BE'); + return; + } + elsif ($data =~ /^\x00\x00\xFF\xFE/) { + # BO-UCS-4-2143 + $reader->move_along(4); + $reader->set_encoding('UCS-4-2143'); + return; + } + elsif ($data =~ /^\x00\x00\x00\x3C/) { + $reader->set_encoding('UCS-4BE'); + return; + } + elsif ($data =~ /^\x00\x00\x3C\x00/) { + $reader->set_encoding('UCS-4-2143'); + return; + } + elsif ($data =~ /^\x00\x3C\x00\x00/) { + $reader->set_encoding('UCS-4-3412'); + return; + } + elsif ($data =~ /^\x00\x3C\x00\x3F/) { + $reader->set_encoding('UTF-16BE'); + return; + } + elsif ($data =~ /^\xFF\xFE\x00\x00/) { + # BO-UCS-4LE + $reader->move_along(4); + $reader->set_encoding('UCS-4LE'); + return; + } + elsif ($data =~ /^\xFF\xFE/) { + $reader->move_along(2); + $reader->set_encoding('UTF-16LE'); + return; + } + elsif ($data =~ /^\xFE\xFF\x00\x00/) { + $reader->move_along(4); + $reader->set_encoding('UCS-4-3412'); + return; + } + elsif ($data =~ /^\xFE\xFF/) { + $reader->move_along(2); + $reader->set_encoding('UTF-16BE'); + return; + } + elsif ($data =~ /^\xEF\xBB\xBF/) { # UTF-8 BOM + $reader->move_along(3); + $reader->set_encoding('UTF-8'); + return; + } + elsif ($data =~ /^\x3C\x00\x00\x00/) { + $reader->set_encoding('UCS-4LE'); + return; + } + elsif ($data =~ /^\x3C\x00\x3F\x00/) { + $reader->set_encoding('UTF-16LE'); + return; + } + elsif ($data =~ /^\x3C\x3F\x78\x6D/) { + # $reader->set_encoding('UTF-8'); + return; + } + elsif ($data =~ /^\x3C\x3F\x78/) { + # $reader->set_encoding('UTF-8'); + return; + } + elsif ($data =~ /^\x3C\x3F/) { + # $reader->set_encoding('UTF-8'); + return; + } + elsif ($data =~ /^\x3C/) { + # $reader->set_encoding('UTF-8'); + return; + } + elsif ($data =~ /^[\x20\x09\x0A\x0D]+\x3C[^\x3F]/) { + # $reader->set_encoding('UTF-8'); + return; + } + elsif ($data =~ /^\x4C\x6F\xA7\x94/) { + $reader->set_encoding('EBCDIC'); + return; + } + + warn("Unable to recognise encoding of this document"); + return; +} + +1; + diff --git a/lib/XML/SAX/PurePerl/Exception.pm b/lib/XML/SAX/PurePerl/Exception.pm new file mode 100644 index 0000000..1ade99b --- /dev/null +++ b/lib/XML/SAX/PurePerl/Exception.pm @@ -0,0 +1,67 @@ +# $Id$ + +package XML::SAX::PurePerl::Exception; + +use strict; + +use overload '""' => "stringify"; + +use vars qw/$StackTrace/; + +$StackTrace = $ENV{XML_DEBUG} || 0; + +sub throw { + my $class = shift; + die $class->new(@_); +} + +sub new { + my $class = shift; + my %opts = @_; + die "Invalid options" unless exists $opts{Message}; + + if ($opts{reader}) { + return bless { Message => $opts{Message}, + Exception => undef, # not sure what this is for!!! + ColumnNumber => $opts{reader}->column, + LineNumber => $opts{reader}->line, + PublicId => $opts{reader}->public_id, + SystemId => $opts{reader}->system_id, + $StackTrace ? (StackTrace => stacktrace()) : (), + }, $class; + } + return bless { Message => $opts{Message}, + Exception => undef, # not sure what this is for!!! + }, $class; +} + +sub stringify { + my $self = shift; + local $^W; + return $self->{Message} . " [Ln: " . $self->{LineNumber} . + ", Col: " . $self->{ColumnNumber} . "]" . + ($StackTrace ? stackstring($self->{StackTrace}) : "") . "\n"; +} + +sub stacktrace { + my $i = 2; + my @fulltrace; + while (my @trace = caller($i++)) { + my %hash; + @hash{qw(Package Filename Line)} = @trace[0..2]; + push @fulltrace, \%hash; + } + return \@fulltrace; +} + +sub stackstring { + my $stacktrace = shift; + my $string = "\nFrom:\n"; + foreach my $current (@$stacktrace) { + $string .= $current->{Filename} . " Line: " . $current->{Line} . "\n"; + } + return $string; +} + +1; + diff --git a/lib/XML/SAX/PurePerl/NoUnicodeExt.pm b/lib/XML/SAX/PurePerl/NoUnicodeExt.pm new file mode 100644 index 0000000..70d2071 --- /dev/null +++ b/lib/XML/SAX/PurePerl/NoUnicodeExt.pm @@ -0,0 +1,28 @@ +# $Id$ + +package XML::SAX::PurePerl; +use strict; + +sub chr_ref { + my $n = shift; + if ($n < 0x80) { + return chr ($n); + } + elsif ($n < 0x800) { + return pack ("CC", (($n >> 6) | 0xc0), (($n & 0x3f) | 0x80)); + } + elsif ($n < 0x10000) { + return pack ("CCC", (($n >> 12) | 0xe0), ((($n >> 6) & 0x3f) | 0x80), + (($n & 0x3f) | 0x80)); + } + elsif ($n < 0x110000) + { + return pack ("CCCC", (($n >> 18) | 0xf0), ((($n >> 12) & 0x3f) | 0x80), + ((($n >> 6) & 0x3f) | 0x80), (($n & 0x3f) | 0x80)); + } + else { + return undef; + } +} + +1; diff --git a/lib/XML/SAX/PurePerl/Productions.pm b/lib/XML/SAX/PurePerl/Productions.pm new file mode 100644 index 0000000..25d49fb --- /dev/null +++ b/lib/XML/SAX/PurePerl/Productions.pm @@ -0,0 +1,147 @@ +# $Id$ + +package XML::SAX::PurePerl::Productions; + +use Exporter; +@ISA = ('Exporter'); +@EXPORT_OK = qw($S $Char $VersionNum $BaseChar $Ideographic + $Extender $Digit $CombiningChar $EncNameStart $EncNameEnd $NameChar $CharMinusDash + $PubidChar $Any $SingleChar); + +### WARNING!!! All productions here must *only* match a *single* character!!! ### + +BEGIN { +$S = qr/[\x20\x09\x0D\x0A]/; + +$CharMinusDash = qr/[^-]/x; + +$Any = qr/ . /xms; + +$VersionNum = qr/ [a-zA-Z0-9_.:-]+ /x; + +$EncNameStart = qr/ [A-Za-z] /x; +$EncNameEnd = qr/ [A-Za-z0-9\._-] /x; + +$PubidChar = qr/ [\x20\x0D\x0Aa-zA-Z0-9'()\+,.\/:=\?;!*\#@\$_\%-] /x; + +if ($] < 5.006) { + eval <<' PERL'; + $Char = qr/^ [\x09\x0A\x0D\x20-\x7F]|([\xC0-\xFD][\x80-\xBF]+) $/x; + + $SingleChar = qr/^$Char$/; + + $BaseChar = qr/ [\x41-\x5A\x61-\x7A]|([\xC0-\xFD][\x80-\xBF]+) /x; + + $Extender = qr/ \xB7 /x; + + $Digit = qr/ [\x30-\x39] /x; + + # can't do this one without unicode + # $CombiningChar = qr/^$/msx; + + $NameChar = qr/^ (?: $BaseChar | $Digit | [._:-] | $Extender )+ $/x; + PERL + die $@ if $@; +} +else { + eval <<' PERL'; + + use utf8; # for 5.6 + + $Char = qr/^ [\x09\x0A\x0D\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}] $/x; + + $SingleChar = qr/^$Char$/; + + $BaseChar = qr/ +[\x{0041}-\x{005A}\x{0061}-\x{007A}\x{00C0}-\x{00D6}\x{00D8}-\x{00F6}] | +[\x{00F8}-\x{00FF}\x{0100}-\x{0131}\x{0134}-\x{013E}\x{0141}-\x{0148}] | +[\x{014A}-\x{017E}\x{0180}-\x{01C3}\x{01CD}-\x{01F0}\x{01F4}-\x{01F5}] | +[\x{01FA}-\x{0217}\x{0250}-\x{02A8}\x{02BB}-\x{02C1}\x{0386}\x{0388}-\x{038A}] | +[\x{038C}\x{038E}-\x{03A1}\x{03A3}-\x{03CE}\x{03D0}-\x{03D6}\x{03DA}] | +[\x{03DC}\x{03DE}\x{03E0}\x{03E2}-\x{03F3}\x{0401}-\x{040C}\x{040E}-\x{044F}] | +[\x{0451}-\x{045C}\x{045E}-\x{0481}\x{0490}-\x{04C4}\x{04C7}-\x{04C8}] | +[\x{04CB}-\x{04CC}\x{04D0}-\x{04EB}\x{04EE}-\x{04F5}\x{04F8}-\x{04F9}] | +[\x{0531}-\x{0556}\x{0559}\x{0561}-\x{0586}\x{05D0}-\x{05EA}\x{05F0}-\x{05F2}] | +[\x{0621}-\x{063A}\x{0641}-\x{064A}\x{0671}-\x{06B7}\x{06BA}-\x{06BE}] | +[\x{06C0}-\x{06CE}\x{06D0}-\x{06D3}\x{06D5}\x{06E5}-\x{06E6}\x{0905}-\x{0939}] | +[\x{093D}\x{0958}-\x{0961}\x{0985}-\x{098C}\x{098F}-\x{0990}] | +[\x{0993}-\x{09A8}\x{09AA}-\x{09B0}\x{09B2}\x{09B6}-\x{09B9}\x{09DC}-\x{09DD}] | +[\x{09DF}-\x{09E1}\x{09F0}-\x{09F1}\x{0A05}-\x{0A0A}\x{0A0F}-\x{0A10}] | +[\x{0A13}-\x{0A28}\x{0A2A}-\x{0A30}\x{0A32}-\x{0A33}\x{0A35}-\x{0A36}] | +[\x{0A38}-\x{0A39}\x{0A59}-\x{0A5C}\x{0A5E}\x{0A72}-\x{0A74}\x{0A85}-\x{0A8B}] | +[\x{0A8D}\x{0A8F}-\x{0A91}\x{0A93}-\x{0AA8}\x{0AAA}-\x{0AB0}] | +[\x{0AB2}-\x{0AB3}\x{0AB5}-\x{0AB9}\x{0ABD}\x{0AE0}\x{0B05}-\x{0B0C}] | +[\x{0B0F}-\x{0B10}\x{0B13}-\x{0B28}\x{0B2A}-\x{0B30}\x{0B32}-\x{0B33}] | +[\x{0B36}-\x{0B39}\x{0B3D}\x{0B5C}-\x{0B5D}\x{0B5F}-\x{0B61}\x{0B85}-\x{0B8A}] | +[\x{0B8E}-\x{0B90}\x{0B92}-\x{0B95}\x{0B99}-\x{0B9A}\x{0B9C}] | +[\x{0B9E}-\x{0B9F}\x{0BA3}-\x{0BA4}\x{0BA8}-\x{0BAA}\x{0BAE}-\x{0BB5}] | +[\x{0BB7}-\x{0BB9}\x{0C05}-\x{0C0C}\x{0C0E}-\x{0C10}\x{0C12}-\x{0C28}] | +[\x{0C2A}-\x{0C33}\x{0C35}-\x{0C39}\x{0C60}-\x{0C61}\x{0C85}-\x{0C8C}] | +[\x{0C8E}-\x{0C90}\x{0C92}-\x{0CA8}\x{0CAA}-\x{0CB3}\x{0CB5}-\x{0CB9}\x{0CDE}] | +[\x{0CE0}-\x{0CE1}\x{0D05}-\x{0D0C}\x{0D0E}-\x{0D10}\x{0D12}-\x{0D28}] | +[\x{0D2A}-\x{0D39}\x{0D60}-\x{0D61}\x{0E01}-\x{0E2E}\x{0E30}\x{0E32}-\x{0E33}] | +[\x{0E40}-\x{0E45}\x{0E81}-\x{0E82}\x{0E84}\x{0E87}-\x{0E88}\x{0E8A}] | +[\x{0E8D}\x{0E94}-\x{0E97}\x{0E99}-\x{0E9F}\x{0EA1}-\x{0EA3}\x{0EA5}\x{0EA7}] | +[\x{0EAA}-\x{0EAB}\x{0EAD}-\x{0EAE}\x{0EB0}\x{0EB2}-\x{0EB3}\x{0EBD}] | +[\x{0EC0}-\x{0EC4}\x{0F40}-\x{0F47}\x{0F49}-\x{0F69}\x{10A0}-\x{10C5}] | +[\x{10D0}-\x{10F6}\x{1100}\x{1102}-\x{1103}\x{1105}-\x{1107}\x{1109}] | +[\x{110B}-\x{110C}\x{110E}-\x{1112}\x{113C}\x{113E}\x{1140}\x{114C}\x{114E}] | +[\x{1150}\x{1154}-\x{1155}\x{1159}\x{115F}-\x{1161}\x{1163}\x{1165}] | +[\x{1167}\x{1169}\x{116D}-\x{116E}\x{1172}-\x{1173}\x{1175}\x{119E}\x{11A8}] | +[\x{11AB}\x{11AE}-\x{11AF}\x{11B7}-\x{11B8}\x{11BA}\x{11BC}-\x{11C2}] | +[\x{11EB}\x{11F0}\x{11F9}\x{1E00}-\x{1E9B}\x{1EA0}-\x{1EF9}\x{1F00}-\x{1F15}] | +[\x{1F18}-\x{1F1D}\x{1F20}-\x{1F45}\x{1F48}-\x{1F4D}\x{1F50}-\x{1F57}] | +[\x{1F59}\x{1F5B}\x{1F5D}\x{1F5F}-\x{1F7D}\x{1F80}-\x{1FB4}\x{1FB6}-\x{1FBC}] | +[\x{1FBE}\x{1FC2}-\x{1FC4}\x{1FC6}-\x{1FCC}\x{1FD0}-\x{1FD3}] | +[\x{1FD6}-\x{1FDB}\x{1FE0}-\x{1FEC}\x{1FF2}-\x{1FF4}\x{1FF6}-\x{1FFC}] | +[\x{2126}\x{212A}-\x{212B}\x{212E}\x{2180}-\x{2182}\x{3041}-\x{3094}] | +[\x{30A1}-\x{30FA}\x{3105}-\x{312C}\x{AC00}-\x{D7A3}] + /x; + + $Extender = qr/ +[\x{00B7}\x{02D0}\x{02D1}\x{0387}\x{0640}\x{0E46}\x{0EC6}\x{3005}\x{3031}-\x{3035}\x{309D}-\x{309E}\x{30FC}-\x{30FE}] +/x; + + $Digit = qr/ +[\x{0030}-\x{0039}\x{0660}-\x{0669}\x{06F0}-\x{06F9}\x{0966}-\x{096F}] | +[\x{09E6}-\x{09EF}\x{0A66}-\x{0A6F}\x{0AE6}-\x{0AEF}\x{0B66}-\x{0B6F}] | +[\x{0BE7}-\x{0BEF}\x{0C66}-\x{0C6F}\x{0CE6}-\x{0CEF}\x{0D66}-\x{0D6F}] | +[\x{0E50}-\x{0E59}\x{0ED0}-\x{0ED9}\x{0F20}-\x{0F29}] +/x; + + $CombiningChar = qr/ +[\x{0300}-\x{0345}\x{0360}-\x{0361}\x{0483}-\x{0486}\x{0591}-\x{05A1}] | +[\x{05A3}-\x{05B9}\x{05BB}-\x{05BD}\x{05BF}\x{05C1}-\x{05C2}\x{05C4}] | +[\x{064B}-\x{0652}\x{0670}\x{06D6}-\x{06DC}\x{06DD}-\x{06DF}\x{06E0}-\x{06E4}] | +[\x{06E7}-\x{06E8}\x{06EA}-\x{06ED}\x{0901}-\x{0903}\x{093C}] | +[\x{093E}-\x{094C}\x{094D}\x{0951}-\x{0954}\x{0962}-\x{0963}\x{0981}-\x{0983}] | +[\x{09BC}\x{09BE}\x{09BF}\x{09C0}-\x{09C4}\x{09C7}-\x{09C8}] | +[\x{09CB}-\x{09CD}\x{09D7}\x{09E2}-\x{09E3}\x{0A02}\x{0A3C}\x{0A3E}\x{0A3F}] | +[\x{0A40}-\x{0A42}\x{0A47}-\x{0A48}\x{0A4B}-\x{0A4D}\x{0A70}-\x{0A71}] | +[\x{0A81}-\x{0A83}\x{0ABC}\x{0ABE}-\x{0AC5}\x{0AC7}-\x{0AC9}\x{0ACB}-\x{0ACD}] | +[\x{0B01}-\x{0B03}\x{0B3C}\x{0B3E}-\x{0B43}\x{0B47}-\x{0B48}] | +[\x{0B4B}-\x{0B4D}\x{0B56}-\x{0B57}\x{0B82}-\x{0B83}\x{0BBE}-\x{0BC2}] | +[\x{0BC6}-\x{0BC8}\x{0BCA}-\x{0BCD}\x{0BD7}\x{0C01}-\x{0C03}\x{0C3E}-\x{0C44}] | +[\x{0C46}-\x{0C48}\x{0C4A}-\x{0C4D}\x{0C55}-\x{0C56}\x{0C82}-\x{0C83}] | +[\x{0CBE}-\x{0CC4}\x{0CC6}-\x{0CC8}\x{0CCA}-\x{0CCD}\x{0CD5}-\x{0CD6}] | +[\x{0D02}-\x{0D03}\x{0D3E}-\x{0D43}\x{0D46}-\x{0D48}\x{0D4A}-\x{0D4D}\x{0D57}] | +[\x{0E31}\x{0E34}-\x{0E3A}\x{0E47}-\x{0E4E}\x{0EB1}\x{0EB4}-\x{0EB9}] | +[\x{0EBB}-\x{0EBC}\x{0EC8}-\x{0ECD}\x{0F18}-\x{0F19}\x{0F35}\x{0F37}\x{0F39}] | +[\x{0F3E}\x{0F3F}\x{0F71}-\x{0F84}\x{0F86}-\x{0F8B}\x{0F90}-\x{0F95}] | +[\x{0F97}\x{0F99}-\x{0FAD}\x{0FB1}-\x{0FB7}\x{0FB9}\x{20D0}-\x{20DC}\x{20E1}] | +[\x{302A}-\x{302F}\x{3099}\x{309A}] +/x; + + $Ideographic = qr/ +[\x{4E00}-\x{9FA5}\x{3007}\x{3021}-\x{3029}] +/x; + + $NameChar = qr/^ (?: $BaseChar | $Ideographic | $Digit | [._:-] | $CombiningChar | $Extender )+ $/x; + PERL + + die $@ if $@; +} + +} + +1; diff --git a/lib/XML/SAX/PurePerl/Reader.pm b/lib/XML/SAX/PurePerl/Reader.pm new file mode 100644 index 0000000..cbd7283 --- /dev/null +++ b/lib/XML/SAX/PurePerl/Reader.pm @@ -0,0 +1,136 @@ +# $Id$ + +package XML::SAX::PurePerl::Reader; + +use strict; +use XML::SAX::PurePerl::Reader::URI; +use Exporter (); + +use vars qw(@ISA @EXPORT_OK); +@ISA = qw(Exporter); +@EXPORT_OK = qw( + EOF + BUFFER + LINE + COLUMN + ENCODING + XML_VERSION +); + +use constant EOF => 0; +use constant BUFFER => 1; +use constant LINE => 2; +use constant COLUMN => 3; +use constant ENCODING => 4; +use constant SYSTEM_ID => 5; +use constant PUBLIC_ID => 6; +use constant XML_VERSION => 7; + +require XML::SAX::PurePerl::Reader::Stream; +require XML::SAX::PurePerl::Reader::String; + +if ($] >= 5.007002) { + require XML::SAX::PurePerl::Reader::UnicodeExt; +} +else { + require XML::SAX::PurePerl::Reader::NoUnicodeExt; +} + +sub new { + my $class = shift; + my $thing = shift; + + # try to figure if this $thing is a handle of some sort + if (ref($thing) && UNIVERSAL::isa($thing, 'IO::Handle')) { + return XML::SAX::PurePerl::Reader::Stream->new($thing)->init; + } + my $ioref; + if (tied($thing)) { + my $class = ref($thing); + no strict 'refs'; + $ioref = $thing if defined &{"${class}::TIEHANDLE"}; + } + else { + eval { + $ioref = *{$thing}{IO}; + }; + undef $@; + } + if ($ioref) { + return XML::SAX::PurePerl::Reader::Stream->new($thing)->init; + } + + if ($thing =~ /new($thing)->init; + } + + # assume it is a uri + return XML::SAX::PurePerl::Reader::URI->new($thing)->init; +} + +sub init { + my $self = shift; + $self->[LINE] = 1; + $self->[COLUMN] = 1; + $self->read_more; + return $self; +} + +sub data { + my ($self, $min_length) = (@_, 1); + if (length($self->[BUFFER]) < $min_length) { + $self->read_more; + } + return $self->[BUFFER]; +} + +sub match { + my ($self, $char) = @_; + my $data = $self->data; + if (substr($data, 0, 1) eq $char) { + $self->move_along(1); + return 1; + } + return 0; +} + +sub public_id { + my $self = shift; + @_ and $self->[PUBLIC_ID] = shift; + $self->[PUBLIC_ID]; +} + +sub system_id { + my $self = shift; + @_ and $self->[SYSTEM_ID] = shift; + $self->[SYSTEM_ID]; +} + +sub line { + shift->[LINE]; +} + +sub column { + shift->[COLUMN]; +} + +sub get_encoding { + my $self = shift; + return $self->[ENCODING]; +} + +sub get_xml_version { + my $self = shift; + return $self->[XML_VERSION]; +} + +1; + +__END__ + +=head1 NAME + +XML::Parser::PurePerl::Reader - Abstract Reader factory class + +=cut diff --git a/lib/XML/SAX/PurePerl/Reader/NoUnicodeExt.pm b/lib/XML/SAX/PurePerl/Reader/NoUnicodeExt.pm new file mode 100644 index 0000000..1d27416 --- /dev/null +++ b/lib/XML/SAX/PurePerl/Reader/NoUnicodeExt.pm @@ -0,0 +1,25 @@ +# $Id$ + +package XML::SAX::PurePerl::Reader; +use strict; + +sub set_raw_stream { + # no-op +} + +sub switch_encoding_stream { + my ($fh, $encoding) = @_; + throw XML::SAX::Exception::Parse ( + Message => "Only ASCII encoding allowed without perl 5.7.2 or higher. You tried: $encoding", + ) if $encoding !~ /(ASCII|UTF\-?8)/i; +} + +sub switch_encoding_string { + my (undef, $encoding) = @_; + throw XML::SAX::Exception::Parse ( + Message => "Only ASCII encoding allowed without perl 5.7.2 or higher. You tried: $encoding", + ) if $encoding !~ /(ASCII|UTF\-?8)/i; +} + +1; + diff --git a/lib/XML/SAX/PurePerl/Reader/Stream.pm b/lib/XML/SAX/PurePerl/Reader/Stream.pm new file mode 100644 index 0000000..b2b0456 --- /dev/null +++ b/lib/XML/SAX/PurePerl/Reader/Stream.pm @@ -0,0 +1,84 @@ +# $Id$ + +package XML::SAX::PurePerl::Reader::Stream; + +use strict; +use vars qw(@ISA); + +use XML::SAX::PurePerl::Reader qw( + EOF + BUFFER + LINE + COLUMN + ENCODING + XML_VERSION +); +use XML::SAX::Exception; + +@ISA = ('XML::SAX::PurePerl::Reader'); + +# subclassed by adding 1 to last element +use constant FH => 8; +use constant BUFFER_SIZE => 4096; + +sub new { + my $class = shift; + my $ioref = shift; + XML::SAX::PurePerl::Reader::set_raw_stream($ioref); + my @parts; + @parts[FH, LINE, COLUMN, BUFFER, EOF, XML_VERSION] = + ($ioref, 1, 0, '', 0, '1.0'); + return bless \@parts, $class; +} + +sub read_more { + my $self = shift; + my $buf; + my $bytesread = read($self->[FH], $buf, BUFFER_SIZE); + if ($bytesread) { + $self->[BUFFER] .= $buf; + return 1; + } + elsif (defined($bytesread)) { + $self->[EOF]++; + return 0; + } + else { + throw XML::SAX::Exception::Parse( + Message => "Error reading from filehandle: $!", + ); + } +} + +sub move_along { + my $self = shift; + my $discarded = substr($self->[BUFFER], 0, $_[0], ''); + + # Wish I could skip this lot - tells us where we are in the file + my $lines = $discarded =~ tr/\n//; + $self->[LINE] += $lines; + if ($lines) { + $discarded =~ /\n([^\n]*)$/; + $self->[COLUMN] = length($1); + } + else { + $self->[COLUMN] += $_[0]; + } +} + +sub set_encoding { + my $self = shift; + my ($encoding) = @_; + # warn("set encoding to: $encoding\n"); + XML::SAX::PurePerl::Reader::switch_encoding_stream($self->[FH], $encoding); + XML::SAX::PurePerl::Reader::switch_encoding_string($self->[BUFFER], $encoding); + $self->[ENCODING] = $encoding; +} + +sub bytepos { + my $self = shift; + tell($self->[FH]); +} + +1; + diff --git a/lib/XML/SAX/PurePerl/Reader/String.pm b/lib/XML/SAX/PurePerl/Reader/String.pm new file mode 100644 index 0000000..bd8e6e7 --- /dev/null +++ b/lib/XML/SAX/PurePerl/Reader/String.pm @@ -0,0 +1,78 @@ +# $Id$ + +package XML::SAX::PurePerl::Reader::String; + +use strict; +use vars qw(@ISA); + +use XML::SAX::PurePerl::Reader qw( + LINE + COLUMN + BUFFER + ENCODING + EOF +); + +@ISA = ('XML::SAX::PurePerl::Reader'); + +use constant DISCARDED => 8; +use constant STRING => 9; +use constant USED => 10; +use constant CHUNK_SIZE => 2048; + +sub new { + my $class = shift; + my $string = shift; + my @parts; + @parts[BUFFER, EOF, LINE, COLUMN, DISCARDED, STRING, USED] = + ('', 0, 1, 0, 0, $string, 0); + return bless \@parts, $class; +} + +sub read_more () { + my $self = shift; + if ($self->[USED] >= length($self->[STRING])) { + $self->[EOF]++; + return 0; + } + my $bytes = CHUNK_SIZE; + if ($bytes > (length($self->[STRING]) - $self->[USED])) { + $bytes = (length($self->[STRING]) - $self->[USED]); + } + $self->[BUFFER] .= substr($self->[STRING], $self->[USED], $bytes); + $self->[USED] += $bytes; + return 1; + } + + +sub move_along { + my($self, $bytes) = @_; + my $discarded = substr($self->[BUFFER], 0, $bytes, ''); + $self->[DISCARDED] += length($discarded); + + # Wish I could skip this lot - tells us where we are in the file + my $lines = $discarded =~ tr/\n//; + $self->[LINE] += $lines; + if ($lines) { + $discarded =~ /\n([^\n]*)$/; + $self->[COLUMN] = length($1); + } + else { + $self->[COLUMN] += $_[0]; + } +} + +sub set_encoding { + my $self = shift; + my ($encoding) = @_; + + XML::SAX::PurePerl::Reader::switch_encoding_string($self->[BUFFER], $encoding, "utf-8"); + $self->[ENCODING] = $encoding; +} + +sub bytepos { + my $self = shift; + $self->[DISCARDED]; +} + +1; diff --git a/lib/XML/SAX/PurePerl/Reader/URI.pm b/lib/XML/SAX/PurePerl/Reader/URI.pm new file mode 100644 index 0000000..88e449a --- /dev/null +++ b/lib/XML/SAX/PurePerl/Reader/URI.pm @@ -0,0 +1,57 @@ +# $Id$ + +package XML::SAX::PurePerl::Reader::URI; + +use strict; + +use XML::SAX::PurePerl::Reader; +use File::Temp qw(tempfile); +use Symbol; + +## NOTE: This is *not* a subclass of Reader. It just returns Stream or String +## Reader objects depending on what it's capabilities are. + +sub new { + my $class = shift; + my $uri = shift; + # request the URI + if (-e $uri && -f _) { + my $fh = gensym; + open($fh, $uri) || die "Cannot open file $uri : $!"; + return XML::SAX::PurePerl::Reader::Stream->new($fh); + } + elsif ($uri =~ /^file:(.*)$/ && -e $1 && -f _) { + my $file = $1; + my $fh = gensym; + open($fh, $file) || die "Cannot open file $file : $!"; + return XML::SAX::PurePerl::Reader::Stream->new($fh); + } + else { + # request URI, return String reader + require LWP::UserAgent; + my $ua = LWP::UserAgent->new; + $ua->agent("Perl/XML/SAX/PurePerl/1.0 " . $ua->agent); + + my $req = HTTP::Request->new(GET => $uri); + + my $fh = tempfile(); + + my $callback = sub { + my ($data, $response, $protocol) = @_; + print $fh $data; + }; + + my $res = $ua->request($req, $callback, 4096); + + if ($res->is_success) { + seek($fh, 0, 0); + return XML::SAX::PurePerl::Reader::Stream->new($fh); + } + else { + die "LWP Request Failed"; + } + } +} + + +1; diff --git a/lib/XML/SAX/PurePerl/Reader/UnicodeExt.pm b/lib/XML/SAX/PurePerl/Reader/UnicodeExt.pm new file mode 100644 index 0000000..38f90f7 --- /dev/null +++ b/lib/XML/SAX/PurePerl/Reader/UnicodeExt.pm @@ -0,0 +1,23 @@ +# $Id$ + +package XML::SAX::PurePerl::Reader; +use strict; + +use Encode (); + +sub set_raw_stream { + my ($fh) = @_; + binmode($fh, ":bytes"); +} + +sub switch_encoding_stream { + my ($fh, $encoding) = @_; + binmode($fh, ":encoding($encoding)"); +} + +sub switch_encoding_string { + $_[0] = Encode::decode($_[1], $_[0]); +} + +1; + diff --git a/lib/XML/SAX/PurePerl/UnicodeExt.pm b/lib/XML/SAX/PurePerl/UnicodeExt.pm new file mode 100644 index 0000000..edbd6fc --- /dev/null +++ b/lib/XML/SAX/PurePerl/UnicodeExt.pm @@ -0,0 +1,22 @@ +# $Id$ + +package XML::SAX::PurePerl; +use strict; + +no warnings 'utf8'; + +sub chr_ref { + return chr(shift); +} + +if ($] >= 5.007002) { + require Encode; + + Encode::define_alias( "UTF-16" => "UCS-2" ); + Encode::define_alias( "UTF-16BE" => "UCS-2" ); + Encode::define_alias( "UTF-16LE" => "ucs-2le" ); + Encode::define_alias( "UTF16LE" => "ucs-2le" ); +} + +1; + diff --git a/lib/XML/SAX/PurePerl/XMLDecl.pm b/lib/XML/SAX/PurePerl/XMLDecl.pm new file mode 100644 index 0000000..0fba016 --- /dev/null +++ b/lib/XML/SAX/PurePerl/XMLDecl.pm @@ -0,0 +1,129 @@ +# $Id$ + +package XML::SAX::PurePerl; + +use strict; +use XML::SAX::PurePerl::Productions qw($S $VersionNum $EncNameStart $EncNameEnd); + +sub XMLDecl { + my ($self, $reader) = @_; + + my $data = $reader->data(5); + # warn("Looking for xmldecl in: $data"); + if ($data =~ /^<\?xml$S/o) { + $reader->move_along(5); + $self->skip_whitespace($reader); + + # get version attribute + $self->VersionInfo($reader) || + $self->parser_error("XML Declaration lacks required version attribute, or version attribute does not match XML specification", $reader); + + if (!$self->skip_whitespace($reader)) { + my $data = $reader->data(2); + $data =~ /^\?>/ or $self->parser_error("Syntax error", $reader); + $reader->move_along(2); + return; + } + + if ($self->EncodingDecl($reader)) { + if (!$self->skip_whitespace($reader)) { + my $data = $reader->data(2); + $data =~ /^\?>/ or $self->parser_error("Syntax error", $reader); + $reader->move_along(2); + return; + } + } + + $self->SDDecl($reader); + + $self->skip_whitespace($reader); + + my $data = $reader->data(2); + $data =~ /^\?>/ or $self->parser_error("Syntax error", $reader); + $reader->move_along(2); + } + else { + # warn("first 5 bytes: ", join(',', unpack("CCCCC", $data)), "\n"); + # no xml decl + if (!$reader->get_encoding) { + $reader->set_encoding("UTF-8"); + } + } +} + +sub VersionInfo { + my ($self, $reader) = @_; + + my $data = $reader->data(11); + + # warn("Looking for version in $data"); + + $data =~ /^(version$S*=$S*(["'])($VersionNum)\2)/o or return 0; + $reader->move_along(length($1)); + my $vernum = $3; + + if ($vernum ne "1.0") { + $self->parser_error("Only XML version 1.0 supported. Saw: '$vernum'", $reader); + } + + return 1; +} + +sub SDDecl { + my ($self, $reader) = @_; + + my $data = $reader->data(15); + + $data =~ /^(standalone$S*=$S*(["'])(yes|no)\2)/o or return 0; + $reader->move_along(length($1)); + my $yesno = $3; + + if ($yesno eq 'yes') { + $self->{standalone} = 1; + } + else { + $self->{standalone} = 0; + } + + return 1; +} + +sub EncodingDecl { + my ($self, $reader) = @_; + + my $data = $reader->data(12); + + $data =~ /^(encoding$S*=$S*(["'])($EncNameStart$EncNameEnd*)\2)/o or return 0; + $reader->move_along(length($1)); + my $encoding = $3; + + $reader->set_encoding($encoding); + + return 1; +} + +sub TextDecl { + my ($self, $reader) = @_; + + my $data = $reader->data(6); + $data =~ /^<\?xml$S+/ or return; + $reader->move_along(5); + $self->skip_whitespace($reader); + + if ($self->VersionInfo($reader)) { + $self->skip_whitespace($reader) || + $self->parser_error("Lack of whitespace after version attribute in text declaration", $reader); + } + + $self->EncodingDecl($reader) || + $self->parser_error("Encoding declaration missing from external entity text declaration", $reader); + + $self->skip_whitespace($reader); + + $data = $reader->data(2); + $data =~ /^\?>/ or $self->parser_error("Syntax error", $reader); + + return 1; +} + +1; diff --git a/t/00basic.t b/t/00basic.t new file mode 100644 index 0000000..810706e --- /dev/null +++ b/t/00basic.t @@ -0,0 +1,11 @@ +use Test; +BEGIN { plan tests => 2 } +END { ok($loaded == 2) } +use XML::SAX; +$loaded++; + +use XML::SAX::PurePerl; +$loaded++; + +ok(XML::SAX->VERSION eq XML::SAX::PurePerl->VERSION); + diff --git a/t/01known.t b/t/01known.t new file mode 100644 index 0000000..a2db4f1 --- /dev/null +++ b/t/01known.t @@ -0,0 +1,6 @@ +use Test; +BEGIN { plan tests => 3 } +use XML::SAX; +ok(XML::SAX->save_parsers); +ok(XML::SAX->load_parsers); +ok(@{XML::SAX->parsers}, 0); diff --git a/t/10xmldecl1.t b/t/10xmldecl1.t new file mode 100644 index 0000000..35d863e --- /dev/null +++ b/t/10xmldecl1.t @@ -0,0 +1,19 @@ +use Test; +BEGIN { plan tests => 5 } +use XML::SAX::PurePerl; +use XML::SAX::PurePerl::DebugHandler; +use IO::File; + +my $handler = XML::SAX::PurePerl::DebugHandler->new(); +ok($handler); + +my $parser = XML::SAX::PurePerl->new(Handler => $handler); +ok($parser); + +my $file = IO::File->new("testfiles/01.xml") || die $!; +ok($file); + +$parser->parse_file($file); +ok($handler->{seen}{start_document}); +ok($handler->{seen}{end_document}); + diff --git a/t/11xmldecl2.t b/t/11xmldecl2.t new file mode 100644 index 0000000..9074dfb --- /dev/null +++ b/t/11xmldecl2.t @@ -0,0 +1,38 @@ +use Test; +BEGIN { plan tests => 10 } +use XML::SAX::PurePerl; +use XML::SAX::PurePerl::DebugHandler; +use IO::File; + +my $handler = XML::SAX::PurePerl::DebugHandler->new(); +ok($handler); + +my $parser = XML::SAX::PurePerl->new(Handler => $handler); +ok($parser); + +my $file = IO::File->new("testfiles/02a.xml"); +ok($file); + +# check invalid characters +eval { +$parser->parse_file($file); +}; +ok($@); + +print $@; + +ok($@->{Message}); +ok($@->{LineNumber}, 1); +ok($@->{ColumnNumber}, 6); + +# check invalid version number +eval { +$parser->parse_uri("file:testfiles/02b.xml"); +}; +ok($@); + +print $@; + +ok($@->{LineNumber}, 1); +ok($@->{ColumnNumber}, 19); + diff --git a/t/12miscstart.t b/t/12miscstart.t new file mode 100644 index 0000000..5325c23 --- /dev/null +++ b/t/12miscstart.t @@ -0,0 +1,25 @@ +use Test; +BEGIN { plan tests => 6 } +use XML::SAX::PurePerl; +use XML::SAX::PurePerl::DebugHandler; + +my $handler = XML::SAX::PurePerl::DebugHandler->new(); +ok($handler); + +my $parser = XML::SAX::PurePerl->new(Handler => $handler, + "http://xml.org/sax/handlers/LexicalHandler" => $handler); +ok($parser); + +# check PIs and comments before document element parse ok +$parser->parse_uri("testfiles/03a.xml"); +ok($handler->{seen}{processing_instruction}, 2); +ok($handler->{seen}{comment}); + +# check invalid version number +eval { +$parser->parse_uri("testfiles/03b.xml"); +}; +ok($@); +ok($@->{LineNumber}, 4); + + diff --git a/t/13int_ent.t b/t/13int_ent.t new file mode 100644 index 0000000..816ce96 --- /dev/null +++ b/t/13int_ent.t @@ -0,0 +1,14 @@ +use Test; +BEGIN { plan tests => 3 } +use XML::SAX::PurePerl; +use XML::SAX::PurePerl::DebugHandler; + +my $handler = XML::SAX::PurePerl::DebugHandler->new(); +ok($handler); + +my $parser = XML::SAX::PurePerl->new(Handler => $handler); +ok($parser); + +$parser->parse_uri("testfiles/04a.xml"); +ok(1); + diff --git a/t/14encoding.t b/t/14encoding.t new file mode 100644 index 0000000..18fecb7 --- /dev/null +++ b/t/14encoding.t @@ -0,0 +1,53 @@ +use Test; +BEGIN { $tests = 0; + if ($] >= 5.007002) { $tests = 9 } + plan tests => $tests; +} +if ($tests) { +use XML::SAX::PurePerl; + +my $handler = TestHandler->new(); # see below for the TestHandler class +ok($handler); + +my $parser = XML::SAX::PurePerl->new(Handler => $handler); +ok($parser); + +# warn("utf-16\n"); +# verify that the first element is correctly decoded +$handler->{test_elements} = [ "\x{9031}\x{5831}" ]; +$parser->parse_uri("testfiles/utf-16.xml"); +ok(1); + +# warn("utf-16le\n"); +$handler->{test_elements} = [ "foo" ]; +$parser->parse_uri("testfiles/utf-16le.xml"); +ok(1); + +# warn("koi8_r\n"); +$parser->parse_uri("testfiles/koi8_r.xml"); +ok(1); + +# warn("8859-1\n"); +$parser->parse_uri("testfiles/iso8859_1.xml"); +ok(1); + +# warn("8859-2\n"); +$parser->parse_uri("testfiles/iso8859_2.xml"); +ok(1); +} + +package TestHandler; +use XML::SAX::PurePerl::DebugHandler; +use base qw(XML::SAX::PurePerl::DebugHandler); +use Test; + +sub start_element { + my $self = shift; + if ($self->{test_elements} and + my $value = pop @{$self->{test_elements}}) { + ok($_[0]->{Name}, $value); + } + $self->SUPER::start_element(@_); +} + +1; diff --git a/t/15element.t b/t/15element.t new file mode 100644 index 0000000..cf40d9d --- /dev/null +++ b/t/15element.t @@ -0,0 +1,113 @@ +use Test; +BEGIN { plan tests => 32 } +use XML::SAX::PurePerl; +use XML::SAX::PurePerl::DebugHandler; + +my $handler = TestElementHandler->new(); +ok($handler); + +my $parser = XML::SAX::PurePerl->new(Handler => $handler); +ok($parser); + +$handler->{test_el}{Name} = "foo"; +$handler->{test_el}{LocalName} = "foo"; +$handler->{test_el}{Prefix} = ""; +$handler->{test_el}{NamespaceURI} = ""; +$parser->parse_uri("testfiles/06a.xml"); + +$handler->reset; + +$handler->{attr_name} = "{}a"; +$handler->{test_attr}{Name} = "a"; +$handler->{test_attr}{Value} = "1"; +$handler->{test_attr}{Prefix} = ""; +$handler->{test_attr}{LocalName} = "a"; +$handler->{test_attr}{NamespaceURI} = ""; +$parser->parse_uri("testfiles/06b.xml"); + +$handler->reset; + +$handler->{test_el}{Name} = "foo"; +$handler->{test_el}{LocalName} = "foo"; +$handler->{test_el}{Prefix} = ""; +$handler->{test_el}{NamespaceURI} = "http://foo.com"; +$parser->parse_uri("testfiles/06c.xml"); + +$handler->reset; +$handler->{test_el}{Name} = "x:foo"; +$handler->{test_el}{LocalName} = "foo"; +$handler->{test_el}{Prefix} = "x"; +$handler->{test_el}{NamespaceURI} = "http://foo.com"; +$handler->{attr_name} = "{http://foo.com}a"; +$handler->{test_attr}{Name} = "x:a"; +$handler->{test_attr}{Value} = "2"; +$handler->{test_attr}{Prefix} = "x"; +$handler->{test_attr}{LocalName} = "a"; +$handler->{test_attr}{NamespaceURI} = "http://foo.com"; +$parser->parse_uri("testfiles/06d.xml"); + +$handler->reset; +$handler->{test_el}{Name} = "foo"; +$handler->{test_el}{LocalName} = "foo"; +$handler->{test_el}{Prefix} = ""; +$handler->{test_el}{NamespaceURI} = "http://foo.com"; +$handler->{attr_name} = "{}a"; +$handler->{test_attr}{Name} = "a"; +$handler->{test_attr}{NamespaceURI} = ""; +$parser->parse_uri("testfiles/06e.xml"); + +$handler->reset; +# prefix with no ns binding. Error. +eval { +$parser->parse_uri("testfiles/06f.xml"); +}; +ok($@); + +$handler->reset; +$handler->{test_chr}{Data} = "bar"; +$parser->parse_uri("testfiles/06g.xml"); + +## HELPER PACKAGE ## + +package TestElementHandler; + +use Test; + +sub new { + my $class = shift; + my %opts = @_; + bless \%opts, $class; +} + +sub reset { + my $self = shift; + %$self = (); +} + +sub start_element { + my ($self, $el) = @_; + if ($self->{test_el}) { + foreach my $key (keys %{ $self->{test_el} }) { + #warn $key, "\n"; + ok("$key: $el->{$key}", "$key: $self->{test_el}{$key}"); + } + } + if ($self->{test_attr}) { + foreach my $key (keys %{ $self->{test_attr} }) { + #warn $key, "\n"; + ok("$key: $el->{Attributes}{ $self->{attr_name} }{$key}", "$key: $self->{test_attr}{$key}"); + } + } +} + +sub characters { + my ($self, @text) = @_; + if ($self->{test_chr}) { + foreach my $text (@text) { + foreach my $key (keys %{ $self->{test_chr} }) { + #warn $key, "\n"; + ok("$key: $text->{$key}", "$key: $self->{test_chr}{$key}"); + } + } + } +} diff --git a/t/16large.t b/t/16large.t new file mode 100644 index 0000000..537dd8d --- /dev/null +++ b/t/16large.t @@ -0,0 +1,16 @@ +use Test; +BEGIN { plan tests => 3 } +use XML::SAX::PurePerl; +use XML::SAX::PurePerl::DebugHandler; + +my $handler = XML::SAX::PurePerl::DebugHandler->new(); +ok($handler); + +my $parser = XML::SAX::PurePerl->new(Handler => $handler); +ok($parser); + +my $time = time; +$parser->parse_uri("testfiles/xmltest.xml"); +warn("parsed ", -s "testfiles/xmltest.xml", " bytes in ", time - $time, " seconds\n"); +ok(1); + diff --git a/t/19pi.t b/t/19pi.t new file mode 100644 index 0000000..622c98c --- /dev/null +++ b/t/19pi.t @@ -0,0 +1,26 @@ +use Test; +BEGIN { plan tests => 2 } +use XML::SAX::PurePerl; +use XML::SAX::PurePerl::DebugHandler; + +my $value = do { local $/ = undef; my $data = ; }; + +my $parser = XML::SAX::PurePerl->new(Handler => My::SAXFilter->new()); +$parser->parse_string($value); + +package My::SAXFilter; + +use base qw(XML::SAX::Base); + +sub processing_instruction { + my $this = shift; + my $data = shift; + + main::ok($data->{Target},"xml-stylesheet"); + main::ok($data->{Data},"type=\"text/xsl\" href=\"processorinxml/base.xsl\""); +} + +__END__ + + + diff --git a/t/20factory.t b/t/20factory.t new file mode 100644 index 0000000..864b346 --- /dev/null +++ b/t/20factory.t @@ -0,0 +1,42 @@ +use Test; +BEGIN { plan tests => 16 } +use XML::SAX::ParserFactory; + +# load SAX parsers (no ParserDetails.ini available at first in blib) +use XML::SAX qw(Namespaces Validation); +ok(@{XML::SAX->parsers}, 0); +ok(XML::SAX->add_parser(q(XML::SAX::PurePerl))); +ok(@{XML::SAX->parsers}, 1); + +ok(XML::SAX::ParserFactory->parser); # test class method +my $factory = XML::SAX::ParserFactory->new(); +ok($factory); +ok($factory->parser); + +ok($factory->require_feature(Namespaces)); +ok($factory->parser); + +ok($factory->require_feature(Validation)); +eval { + my $parser = $factory->parser; + # should never get here unless PurePerl starts providing validation + ok(!$parser); +}; +ok($@); +ok($@->isa('XML::SAX::Exception')); + +$factory = XML::SAX::ParserFactory->new(); +my $parser = $factory->parser; +ok($parser); +eval { + $parser->parse_string(''); + ok(1); +}; +ok(!$@); + +local $XML::SAX::ParserPackage = 'XML::SAX::PurePerl'; +ok(XML::SAX::ParserFactory->parser); + +local $XML::SAX::ParserPackage = 'XML::SAX::PurePerl (0.01)'; +ok(XML::SAX::ParserFactory->parser); + diff --git a/t/21saxini.t b/t/21saxini.t new file mode 100644 index 0000000..047a04c --- /dev/null +++ b/t/21saxini.t @@ -0,0 +1,63 @@ +use strict; +use Test; +plan tests => 8; + +use File::Path; +use Fatal qw(open close chdir mkpath); +use XML::SAX; +use vars qw(*FILE); + +ok(@{XML::SAX->parsers}, 0); +ok(XML::SAX->add_parser(q(XML::SAX::PurePerl))); +ok(@{XML::SAX->parsers}, 1); + +rmtree('t/lib', 0, 0); +mkpath('t/lib', 0, 0777); + +push @INC, 't/lib'; +write_file('t/lib/TestParserPackage.pm', <parser(); +ok(ref($parser), "TestParserPackage", "Parser was not TestParserPackage"); + +# Test we can get XML::SAX::PurePerl out +write_file('t/lib/SAX.ini', 'ParserPackage = XML::SAX::PurePerl'); +$parser = XML::SAX::ParserFactory->parser(); +ok(ref($parser), "XML::SAX::PurePerl", "Parser was not XML::SAX::PurePerl"); + +# Test we can ask for a frobnosticating parser, but not get it +write_file('t/lib/SAX.ini', 'http://axkit.org/sax/frobnosticating = 1'); +$parser = XML::SAX::ParserFactory->parser(); +ok(ref($parser), "XML::SAX::PurePerl", "Parser was not PurePerl (frobnosticating)"); + +# Test we can ask for a frobnosticating parser, and get it +XML::SAX->add_parser('TestParserPackage'); +$parser = XML::SAX::ParserFactory->parser(); +ok(ref($parser), "TestParserPackage", "Parser was not TestParserPackage (frobnosticating)"); + +# Test we can get a namespaces parser +write_file('t/lib/SAX.ini', 'http://xml.org/sax/features/namespaces = 1'); +$parser = XML::SAX::ParserFactory->parser(); +ok(ref($parser), "XML::SAX::PurePerl", "Parser was not PurePerl (namespaces)"); + +sub write_file { + my ($file, $data) = @_; + open(FILE, ">$file"); + print FILE $data, "\n"; + close FILE; +} \ No newline at end of file diff --git a/t/30parse_file.t b/t/30parse_file.t new file mode 100644 index 0000000..08a8992 --- /dev/null +++ b/t/30parse_file.t @@ -0,0 +1,29 @@ +use Test; +BEGIN { plan tests => 5 } +use XML::SAX::PurePerl; +use XML::SAX::PurePerl::DebugHandler; +use IO::File; + +my $handler = XML::SAX::PurePerl::DebugHandler->new(); +ok($handler); + +my $parser = XML::SAX::PurePerl->new(Handler => $handler); +ok($parser); + +my $file1 = IO::File->new("testfiles/01.xml"); +ok($file1); + +eval { +$parser->parse_file($file1); +}; +print $@; +ok(!$@); + +my $file2 = "testfiles/01.xml"; + +eval { +$parser->parse_file($file2); +}; +print $@; +ok(!$@); + diff --git a/t/40cdata.t b/t/40cdata.t new file mode 100644 index 0000000..3eac7f2 --- /dev/null +++ b/t/40cdata.t @@ -0,0 +1,31 @@ +use strict; +use warnings; + +use Test; +BEGIN { plan tests => 4 } + +use XML::SAX::PurePerl; + +my $handler = CDataHandler->new(); +ok($handler); + +my $parser = XML::SAX::PurePerl->new(Handler => $handler); +ok($parser); + +$parser->parse_string(']]>'); +ok(1); # parser didn't die + +my $expected = ''; +ok($handler->cbuffer, $expected); + +exit; + + +package CDataHandler; + +use base 'XML::SAX::Base'; + +sub start_document { shift->{_buf} = ''; } +sub characters { shift->{_buf} .= shift->{Data}; } +sub cbuffer { shift->{_buf}; } + diff --git a/t/42entities.t b/t/42entities.t new file mode 100644 index 0000000..6b8572b --- /dev/null +++ b/t/42entities.t @@ -0,0 +1,37 @@ +use strict; +use warnings; + +use Test; +BEGIN { plan tests => 4 } + +use XML::SAX::PurePerl; + +my $handler = AttrHandler->new(); +ok($handler); + +my $parser = XML::SAX::PurePerl->new(Handler => $handler); +ok($parser); + +$parser->parse_string(''); +ok(1); # parser didn't die + +my $expected = "amp=& num=A x3E=> "; +ok($handler->attributes, $expected); + +exit; + + +package AttrHandler; + +use base 'XML::SAX::Base'; + +sub start_document { shift->{_buf} = ''; } +sub attributes { shift->{_buf}; } + +sub start_element { + my($self, $data) = @_; + my $attr = $data->{Attributes}; + foreach (sort keys %$attr) { + $self->{_buf} .= "$attr->{$_}->{LocalName}=$attr->{$_}->{Value} "; + } +} diff --git a/t/99cleanup.t b/t/99cleanup.t new file mode 100644 index 0000000..32e196c --- /dev/null +++ b/t/99cleanup.t @@ -0,0 +1,8 @@ +use Test; +BEGIN { plan tests => 1 } +use File::Spec; +ok(unlink( + File::Spec->catdir(qw(blib lib XML SAX ParserDetails.ini))), + 1, + 'delete ParserDetails.ini' +); diff --git a/testfiles/01.xml b/testfiles/01.xml new file mode 100644 index 0000000..f4e5164 --- /dev/null +++ b/testfiles/01.xml @@ -0,0 +1,2 @@ + + diff --git a/testfiles/02a.xml b/testfiles/02a.xml new file mode 100644 index 0000000..5009872 --- /dev/null +++ b/testfiles/02a.xml @@ -0,0 +1 @@ + diff --git a/testfiles/02b.xml b/testfiles/02b.xml new file mode 100644 index 0000000..e62ca93 --- /dev/null +++ b/testfiles/02b.xml @@ -0,0 +1,4 @@ + + +Hello World + diff --git a/testfiles/03a.xml b/testfiles/03a.xml new file mode 100644 index 0000000..1d5f861 --- /dev/null +++ b/testfiles/03a.xml @@ -0,0 +1,5 @@ + + + + + diff --git a/testfiles/03b.xml b/testfiles/03b.xml new file mode 100644 index 0000000..0700a9b --- /dev/null +++ b/testfiles/03b.xml @@ -0,0 +1,5 @@ + + + + + diff --git a/testfiles/04a.xml b/testfiles/04a.xml new file mode 100644 index 0000000..e4ea5f9 --- /dev/null +++ b/testfiles/04a.xml @@ -0,0 +1 @@ +text & diff --git a/testfiles/06a.xml b/testfiles/06a.xml new file mode 100644 index 0000000..f1999f8 --- /dev/null +++ b/testfiles/06a.xml @@ -0,0 +1 @@ + diff --git a/testfiles/06b.xml b/testfiles/06b.xml new file mode 100644 index 0000000..bda42f9 --- /dev/null +++ b/testfiles/06b.xml @@ -0,0 +1 @@ + diff --git a/testfiles/06c.xml b/testfiles/06c.xml new file mode 100644 index 0000000..85a1e28 --- /dev/null +++ b/testfiles/06c.xml @@ -0,0 +1 @@ + diff --git a/testfiles/06d.xml b/testfiles/06d.xml new file mode 100644 index 0000000..2d4d24a --- /dev/null +++ b/testfiles/06d.xml @@ -0,0 +1 @@ + diff --git a/testfiles/06e.xml b/testfiles/06e.xml new file mode 100644 index 0000000..8943477 --- /dev/null +++ b/testfiles/06e.xml @@ -0,0 +1 @@ + diff --git a/testfiles/06f.xml b/testfiles/06f.xml new file mode 100644 index 0000000..5592dbf --- /dev/null +++ b/testfiles/06f.xml @@ -0,0 +1,2 @@ + + diff --git a/testfiles/06g.xml b/testfiles/06g.xml new file mode 100644 index 0000000..93a15b0 --- /dev/null +++ b/testfiles/06g.xml @@ -0,0 +1 @@ + diff --git a/testfiles/iso8859_1.xml b/testfiles/iso8859_1.xml new file mode 100644 index 0000000..87699dc --- /dev/null +++ b/testfiles/iso8859_1.xml @@ -0,0 +1,2 @@ + +Some Latin-1 chars: éÊË diff --git a/testfiles/iso8859_2.xml b/testfiles/iso8859_2.xml new file mode 100644 index 0000000..bf8bd75 --- /dev/null +++ b/testfiles/iso8859_2.xml @@ -0,0 +1,2 @@ + +Some Latin-2 chars: ñýÎ (could do with some real latin-2 here) diff --git a/testfiles/koi8_r.xml b/testfiles/koi8_r.xml new file mode 100644 index 0000000..e4e60b0 --- /dev/null +++ b/testfiles/koi8_r.xml @@ -0,0 +1,9 @@ + + +<ÄÏËÕÍÅÎÔ ÁÔÔ="ÒÕÓÓËÉÊ"> + <ÒÁÚÄÅÌ> + <ÇÏÌÏ×Á>ôÅÓÔ + úÄÒÁ×ÓÔ×ÕÊ íÉÒ! + + + diff --git a/testfiles/utf-16.xml b/testfiles/utf-16.xml new file mode 100644 index 0000000..6c8622a Binary files /dev/null and b/testfiles/utf-16.xml differ diff --git a/testfiles/utf-16le.xml b/testfiles/utf-16le.xml new file mode 100644 index 0000000..16003ba Binary files /dev/null and b/testfiles/utf-16le.xml differ diff --git a/testfiles/xmltest.xml b/testfiles/xmltest.xml new file mode 100644 index 0000000..4facc3a --- /dev/null +++ b/testfiles/xmltest.xml @@ -0,0 +1,1443 @@ + + + + + + + + Attribute values must start with attribute names, not "?". + + Names may not start with "."; it's not a Letter. + + Processing Instruction target name is required. + + SGML-ism: processing instructions end in '?>' not '>'. + + Processing instructions end in '?>' not '?'. + + XML comments may not contain "--" + + General entity references have no whitespace after the + entity name and before the semicolon. + + Entity references must include names, which don't begin + with '.' (it's not a Letter or other name start character). + + Character references may have only decimal or numeric strings. + + Ampersand may only appear as part of a general entity reference. + + SGML-ism: attribute values must be explicitly assigned a + value, it can't act as a boolean toggle. + + SGML-ism: attribute values must be quoted in all cases. + + The quotes on both ends of an attribute value must match. + + Attribute values may not contain literal '<' characters. + + Attribute values need a value, not just an equals sign. + + Attribute values need an associated name. + + CDATA sections need a terminating ']]>'. + + CDATA sections begin with a literal '<![CDATA[', no space. + + End tags may not be abbreviated as '</>'. + + Attribute values may not contain literal '&' + characters except as part of an entity reference. + + Attribute values may not contain literal '&' + characters except as part of an entity reference. + + Character references end with semicolons, always! + + Digits are not valid name start characters. + + Digits are not valid name start characters. + + Text may not contain a literal ']]>' sequence. + + Text may not contain a literal ']]>' sequence. + + Comments must be terminated with "-->". + + Processing instructions must end with '?>'. + + Text may not contain a literal ']]>' sequence. + + A form feed is not a legal XML character. + + A form feed is not a legal XML character. + + A form feed is not a legal XML character. + + An ESC (octal 033) is not a legal XML character. + + A form feed is not a legal XML character. + + The '<' character is a markup delimiter and must + start an element, CDATA section, PI, or comment. + + Text may not appear after the root element. + + Character references may not appear after the root element. + + Tests the "Unique Att Spec" WF constraint by providing + multiple values for an attribute. + + Tests the Element Type Match WFC - end tag name must + match start tag name. + + Provides two document elements. + + Provides two document elements. + + Invalid End Tag + + Provides #PCDATA text after the document element. + + Provides two document elements. + + Invalid Empty Element Tag + + This start (or empty element) tag was not terminated correctly. + + Invalid empty element tag invalid whitespace + + Provides a CDATA section after the roor element. + + Missing start tag + + Empty document, with no root element. + + CDATA is invalid at top level of document. + + Invalid character reference. + + End tag does not match start tag. + + PUBLIC requires two literals. + + Invalid Document Type Definition format. + + Invalid Document Type Definition format - misplaced comment. + + This isn't SGML; comments can't exist in declarations. + + Invalid character , in ATTLIST enumeration + + String literal must be in quotes. + + Invalid type NAME defined in ATTLIST. + + External entity declarations require whitespace between public + and system IDs. + + Entity declarations need space after the entity name. + + Conditional sections may only appear in the external + DTD subset. + + Space is required between attribute type and default values + in <!ATTLIST...> declarations. + + Space is required between attribute name and type + in <!ATTLIST...> declarations. + + Required whitespace is missing. + + Space is required between attribute type and default values + in <!ATTLIST...> declarations. + + Space is required between NOTATION keyword and list of + enumerated choices in <!ATTLIST...> declarations. + + Space is required before an NDATA entity annotation. + + XML comments may not contain "--" + + ENTITY can't reference itself directly or indirectly. + + Undefined ENTITY foo. + + Undefined ENTITY f. + + Internal general parsed entities are only well formed if + they match the "content" production. + + ENTITY can't reference itself directly or indirectly. + + Undefined ENTITY foo. + + Undefined ENTITY bar. + + Undefined ENTITY foo. + + ENTITY can't reference itself directly or indirectly. + + ENTITY can't reference itself directly or indirectly. + + This tests the No External Entity References WFC, + since the entity is referred to within an attribute. + + This tests the No External Entity References WFC, + since the entity is referred to within an attribute. + + Undefined NOTATION n. + + Tests the Parsed Entity WFC by referring to an + unparsed entity. (This precedes the error of not declaring + that entity's notation, which may be detected any time before + the DTD parsing is completed.) + + Public IDs may not contain "[". + + Public IDs may not contain "[". + + Public IDs may not contain "[". + + Attribute values are terminated by literal quote characters, + and any entity expansion is done afterwards. + + Parameter entities "are" always parsed; NDATA annotations + are not permitted. + + Attributes may not contain a literal "<" character; + this one has one because of reference expansion. + + Parameter entities "are" always parsed; NDATA annotations + are not permitted. + + The replacement text of this entity has an illegal reference, + because the character reference is expanded immediately. + + Hexadecimal character references may not use the uppercase 'X'. + + Prolog VERSION must be lowercase. + + VersionInfo must come before EncodingDecl. + + Space is required before the standalone declaration. + + Both quotes surrounding VersionNum must be the same. + + Only one "version=..." string may appear in an XML declaration. + + Only three pseudo-attributes are in the XML declaration, + and "valid=..." is not one of them. + + Only "yes" and "no" are permitted as values of "standalone". + + Space is not permitted in an encoding name. + + Provides an illegal XML version number; spaces are illegal. + + End-tag required for element foo. + + Internal general parsed entities are only well formed if + they match the "content" production. + + Invalid placement of CDATA section. + + Invalid placement of entity declaration. + + Invalid document type declaration. CDATA alone is invalid. + + No space in '<![CDATA['. + + Tags invalid within EntityDecl. + + Entity reference must be in content of element. + + Entiry reference must be in content of element not Start-tag. + + CDATA sections start '<![CDATA[', not '<!cdata['. + + Parameter entity values must use valid reference syntax; + this reference is malformed. + + General entity values must use valid reference syntax; + this reference is malformed. + + The replacement text of this entity is an illegal character + reference, which must be rejected when it is parsed in the + context of an attribute value. + + Internal general parsed entities are only well formed if + they match the "content" production. This is a partial + character reference, not a full one. + + Internal general parsed entities are only well formed if + they match the "content" production. This is a partial + character reference, not a full one. + + Entity reference expansion is not recursive. + + Internal general parsed entities are only well formed if + they match the "content" production. This is a partial + character reference, not a full one. + + Character references are expanded in the replacement text of + an internal entity, which is then parsed as usual. Accordingly, + & must be doubly quoted - encoded either as &amp; + or as &#38;#38;. + + A name of an ENTITY was started with an invalid character. + + Invalid syntax mixed connectors are used. + + Invalid syntax mismatched parenthesis. + + Invalid format of Mixed-content declaration. + + Invalid syntax extra set of parenthesis not necessary. + + Invalid syntax Mixed-content must be defined as zero or more. + + Invalid syntax Mixed-content must be defined as zero or more. + + Invalid CDATA syntax. + + Invalid syntax for Element Type Declaration. + + Invalid syntax for Element Type Declaration. + + Invalid syntax for Element Type Declaration. + + Invalid syntax mixed connectors used. + + Illegal whitespace before optional character causes syntax error. + + Illegal whitespace before optional character causes syntax error. + + Invalid character used as connector. + + Tag omission is invalid in XML. + + Space is required before a content model. + + Invalid syntax for content particle. + + The element-content model should not be empty. + + Character '&#x309a;' is a CombiningChar, not a + Letter, and so may not begin a name. + + Character #x0E5C is not legal in XML names. + + Character #x0000 is not legal anywhere in an XML document. + + Character #x001F is not legal anywhere in an XML document. + + Character #xFFFF is not legal anywhere in an XML document. + + Character #xD800 is not legal anywhere in an XML document. (If it + appeared in a UTF-16 surrogate pair, it'd represent half of a UCS-4 + character and so wouldn't really be in the document.) + + Character references must also refer to legal XML characters; + #x00110000 is one more than the largest legal character. + + XML Declaration may not be preceded by whitespace. + + XML Declaration may not be preceded by comments or whitespace. + + XML Declaration may not be within a DTD. + + XML declarations may not be within element content. + + XML declarations may not follow document content. + + XML declarations must include the "version=..." string. + + Text declarations may not begin internal parsed entities; + they may only appear at the beginning of external parsed + (parameter or general) entities. + + '<?XML ...?>' is neither an XML declaration + nor a legal processing instruction target name. + + '<?xmL ...?>' is neither an XML declaration + nor a legal processing instruction target name. + + '<?xMl ...?>' is neither an XML declaration + nor a legal processing instruction target name. + + '<?xmL ...?>' is not a legal processing instruction + target name. + + SGML-ism: "#NOTATION gif" can't have attributes. + + Uses '&' unquoted in an entity declaration, + which is illegal syntax for an entity reference. + + Violates the PEs in Internal Subset WFC + by using a PE reference within a declaration. + + Violates the PEs in Internal Subset WFC + by using a PE reference within a declaration. + + Violates the PEs in Internal Subset WFC + by using a PE reference within a declaration. + + Invalid placement of Parameter entity reference. + + Invalid placement of Parameter entity reference. + + Parameter entity declarations must have a space before + the '%'. + + Character FFFF is not legal anywhere in an XML document. + + Character FFFE is not legal anywhere in an XML document. + + An unpaired surrogate (D800) is not legal anywhere + in an XML document. + + An unpaired surrogate (DC00) is not legal anywhere + in an XML document. + + Four byte UTF-8 encodings can encode UCS-4 characters + which are beyond the range of legal XML characters + (and can't be expressed in Unicode surrogate pairs). + This document holds such a character. + + Character FFFF is not legal anywhere in an XML document. + + Character FFFF is not legal anywhere in an XML document. + + Character FFFF is not legal anywhere in an XML document. + + Character FFFF is not legal anywhere in an XML document. + + Character FFFF is not legal anywhere in an XML document. + + Start tags must have matching end tags. + + Character FFFF is not legal anywhere in an XML document. + + Invalid syntax matching double quote is missing. + + Invalid syntax matching double quote is missing. + + The Entity Declared WFC requires entities to be declared + before they are used in an attribute list declaration. + + Internal parsed entities must match the content + production to be well formed. + + Internal parsed entities must match the content + production to be well formed. + + Mixed content declarations may not include content particles. + + In mixed content models, element names must not be + parenthesized. + + Tests the Entity Declared WFC. + Note: a nonvalidating parser is permitted not to report + this WFC violation, since it would need to read an external + parameter entity to distinguish it from a violation of + the Standalone Declaration VC. + + Whitespace is required between attribute/value pairs. + + + + Conditional sections must be properly terminated ("]>" used + instead of "]]>"). + + Processing instruction target names may not be "XML" + in any combination of cases. + + Conditional sections must be properly terminated ("]]>" omitted). + + Conditional sections must be properly terminated ("]]>" omitted). + + Tests the Entity Declared VC by referring to an + undefined parameter entity within an external entity. + + Conditional sections need a '[' after the INCLUDE or IGNORE. + + A <!DOCTYPE ...> declaration may not begin any external + entity; it's only found once, in the document entity. + + In DTDs, the '%' character must be part of a parameter + entity reference. + + + + + Tests the No Recursion WFC by having an external general + entity be self-recursive. + + External entities have "text declarations", which do + not permit the "standalone=..." attribute that's allowed + in XML declarations. + + Only one text declaration is permitted; a second one + looks like an illegal processing instruction (target names + of "xml" in any case are not allowed). + + + + + Tests the "Proper Declaration/PE Nesting" validity constraint by + fragmenting a comment between two parameter entities. + + Tests the "Proper Group/PE Nesting" validity constraint by + fragmenting a content model between two parameter entities. + + Tests the "Proper Declaration/PE Nesting" validity constraint by + fragmenting an element declaration between two parameter entities. + + Tests the "Proper Declaration/PE Nesting" validity constraint by + fragmenting an element declaration between three parameter entities. + + Tests the "Proper Declaration/PE Nesting" validity constraint by + fragmenting an element declaration between two parameter entities. + + Tests the "Proper Declaration/PE Nesting" validity constraint by + fragmenting an element declaration between two parameter entities. + + + + + Test demonstrates an Element Type Declaration with Mixed Content. + + Test demonstrates that whitespace is permitted after the tag name in a Start-tag. + + Test demonstrates that whitespace is permitted after the tag name in an End-tag. + + Test demonstrates a valid attribute specification within a Start-tag. + + Test demonstrates a valid attribute specification within a Start-tag that +contains whitespace on both sides of the equal sign. + + Test demonstrates that the AttValue within a Start-tag can use a single quote as a delimter. + + Test demonstrates numeric character references can be used for element content. + + Test demonstrates character references can be used for element content. + + Test demonstrates that PubidChar can be used for element content. + + Test demonstrates that whitespace is valid after the Attribute in a Start-tag. + + Test demonstrates mutliple Attibutes within the Start-tag. + + Uses a legal XML 1.0 name consisting of a single colon + character (disallowed by the latest XML Namespaces draft). + + Test demonstrates that the Attribute in a Start-tag can consist of numerals along with special characters. + + Test demonstrates that all lower case letters are valid for the Attribute in a Start-tag. + + Test demonstrates that all upper case letters are valid for the Attribute in a Start-tag. + + Test demonstrates that Processing Instructions are valid element content. + + Test demonstrates that Processing Instructions are valid element content and there can be more than one. + + Test demonstrates that CDATA sections are valid element content. + + Test demonstrates that CDATA sections are valid element content and that +ampersands may occur in their literal form. + + Test demonstractes that CDATA sections are valid element content and that +everyting between the CDStart and CDEnd is recognized as character data not markup. + + Test demonstrates that comments are valid element content. + + Test demonstrates that comments are valid element content and that all characters before the double-hypen right angle combination are considered part of thecomment. + + Test demonstrates that Entity References are valid element content. + + Test demonstrates that Entity References are valid element content and also demonstrates a valid Entity Declaration. + + Test demonstrates an Element Type Declaration and that the contentspec can be of mixed content. + + Test demonstrates an Element Type Declaration and that EMPTY is a valid contentspec. + + Test demonstrates an Element Type Declaration and that ANY is a valid contenspec. + + Test demonstrates a valid prolog that uses double quotes as delimeters around the VersionNum. + + Test demonstrates a valid prolog that uses single quotes as delimters around the VersionNum. + + Test demonstrates a valid prolog that contains whitespace on both sides of the equal sign in the VersionInfo. + + Test demonstrates a valid EncodingDecl within the prolog. + + Test demonstrates a valid SDDecl within the prolog. + + Test demonstrates that both a EncodingDecl and SDDecl are valid within the prolog. + + Test demonstrates the correct syntax for an Empty element tag. + + Test demonstrates that whitespace is permissible after the name in an Empty element tag. + + Test demonstrates a valid processing instruction. + + Test demonstrates a valid comment and that it may appear anywhere in the document including at the end. + + Test demonstrates a valid comment and that it may appear anywhere in the document including the beginning. + + Test demonstrates a valid processing instruction and that it may appear at the beginning of the document. + + Test demonstrates an Attribute List declaration that uses a StringType as the AttType. + + Test demonstrates an Attribute List declaration that uses a StringType as the AttType and also expands the CDATA attribute with a character reference. + + Test demonstrates an Attribute List declaration that uses a StringType as the AttType and also expands the CDATA attribute with a character reference. The test also shows that the leading zeros in the character reference are ignored. + + An element's attributes may be declared before its content + model; and attribute values may contain newlines. + + Test demonstrates that the empty-element tag must be use for an elements that are declared EMPTY. + + Tests whether more than one definition can be provided for the same attribute of a given element type with the first declaration being binding. + + Test demonstrates that when more than one AttlistDecl is provided for a given element type, the contents of all those provided are merged. + + Test demonstrates that extra whitespace is normalized into single space character. + + Test demonstrates that character data is valid element content. + + Test demonstrates that characters outside of normal ascii range can be used as element content. + + Test demonstrates that characters outside of normal ascii range can be used as element content. + + The document is encoded in UTF-16 and uses some name + characters well outside of the normal ASCII range. + + + The document is encoded in UTF-8 and the text inside the + root element uses two non-ASCII characters, encoded in UTF-8 + and each of which expands to a Unicode surrogate pair. + + Tests inclusion of a well-formed internal entity, which + holds an element required by the content model. + + Test demonstrates that extra whitespace within Start-tags and End-tags are nomalized into single spaces. + + Test demonstrates that extra whitespace within a processing instruction willnormalized into s single space character. + + Test demonstrates an Attribute List declaration that uses a StringType as the AttType and also expands the CDATA attribute with a character reference. The test also shows that the leading zeros in the character reference are ignored. + + Test demonstrates an element content model whose element can occur zero or more times. + + Test demonstrates that extra whitespace be normalized into a single space character in an attribute of type NMTOKENS. + + Test demonstrates an Element Type Declaration that uses the contentspec of EMPTY. The element cannot have any contents and must always appear as an empty element in the document. The test also shows an Attribute-list declaration with multiple AttDef's. + + Test demonstrates the use of decimal Character References within element content. + + Test demonstrates the use of decimal Character References within element content. + + Test demonstrates the use of hexadecimal Character References within element. + + The document is encoded in UTF-8 and the name of the + root element type uses non-ASCII characters. + + Tests in-line handling of two legal character references, which + each expand to a Unicode surrogate pair. + + Tests ability to define an internal entity which can't + legally be expanded (contains an unquoted <). + + Expands a CDATA attribute with a character reference. + + Test demonstrates the use of decimal character references within element content. + + Tests definition of an entity holding a carriage return character + reference, which must be normalized to a newline before being + reported to the application. + + + Verifies that an XML parser will parse a NOTATION + declaration; the output phase of this test ensures that + it's reported to the application. + + Verifies that internal parameter entities are correctly + expanded within the internal subset. + + Test demonstrates that an AttlistDecl can use ID as the TokenizedType within the Attribute type. The test also shows that IMPLIED is a valid DefaultDecl. + + Test demonstrates that an AttlistDecl can use IDREF as the TokenizedType within the Attribute type. The test also shows that IMPLIED is a valid DefaultDecl. + + Test demonstrates that an AttlistDecl can use IDREFS as the TokenizedType within the Attribute type. The test also shows that IMPLIED is a valid DefaultDecl. + + Test demonstrates that an AttlistDecl can use ENTITY as the TokenizedType within the Attribute type. The test also shows that IMPLIED is a valid DefaultDecl. + + Test demonstrates that an AttlistDecl can use ENTITIES as the TokenizedType within the Attribute type. The test also shows that IMPLIED is a valid DefaultDecl. + + + Verifies that an XML parser will parse a NOTATION + attribute; the output phase of this test ensures that + both notations are reported to the application. + + Test demonstrates that an AttlistDecl can use an EnumeratedType within the Attribute type. The test also shows that IMPLIED is a valid DefaultDecl. + + Test demonstrates that an AttlistDecl can use an StringType of CDATA within the Attribute type. The test also shows that REQUIRED is a valid DefaultDecl. + + Test demonstrates that an AttlistDecl can use an StringType of CDATA within the Attribute type. The test also shows that FIXED is a valid DefaultDecl and that a value can be given to the attribute in the Start-tag as well as the AttListDecl. + + Test demonstrates that an AttlistDecl can use an StringType of CDATA within the Attribute type. The test also shows that FIXED is a valid DefaultDecl and that an value can be given to the attribute. + + Test demonstrates the use of the optional character following a name or list to govern the number of times an element or content particles in the list occur. + + Tests that an external PE may be defined (but not referenced). + + Tests that an external PE may be defined (but not referenced). + + Test demonstrates that although whitespace can be used to set apart markup for greater readability it is not necessary. + + Parameter and General entities use different namespaces, + so there can be an entity of each type with a given name. + + Tests whether entities may be declared more than once, + with the first declaration being the binding one. + + Tests whether character references in internal entities are + expanded early enough, by relying on correct handling to + make the entity be well formed. + + Tests whether entity references in internal entities are + expanded late enough, by relying on correct handling to + make the expanded text be valid. (If it's expanded too + early, the entity will parse as an element that's not + valid in that context.) + + Tests entity expansion of three legal character references, + which each expand to a Unicode surrogate pair. + + + Verifies that an XML parser will parse a NOTATION + attribute; the output phase of this test ensures that + the notation is reported to the application. + + + Verifies that an XML parser will parse an ENTITY + attribute; the output phase of this test ensures that + the notation is reported to the application, and for + validating parsers it further tests that the entity + is so reported. + + Test demostrates that extra whitespace is normalized into a single space character. + + Test demonstrates that extra whitespace is not intended for inclusion in the delivered version of the document. + + + This refers to an undefined parameter entity reference within + a markup declaration in the internal DTD subset, violating + the PEs in Internal Subset WFC. + + Basically an output test, this requires extra whitespace + to be normalized into a single space character in an + attribute of type NMTOKENS. + + Test demonstrates that extra whitespace is normalized into a single space character in an attribute of type NMTOKENS. + + Basically an output test, this tests whether an externally + defined attribute declaration (with a default) takes proper + precedence over a subsequent internal declaration. + + Test demonstrates that extra whitespace within a processing instruction is converted into a single space character. + + Test demonstrates the name of the encoding can be composed of lowercase characters. + + Makes sure that PUBLIC identifiers may have some strange + characters. NOTE: The XML editors have said that the XML + specification errata will specify that parameter entity expansion + does not occur in PUBLIC identifiers, so that the '%' character + will not flag a malformed parameter entity reference. + + This tests whether entity expansion is (incorrectly) done + while processing entity declarations; if it is, the entity + value literal will terminate prematurely. + + Test demonstrates that a CDATA attribute can pass a double quote as its value. + + Test demonstrates that an attribute can pass a less than sign as its value. + + Test demonstrates that extra whitespace within an Attribute of a Start-tag is normalized to a single space character. + + Basically an output test, this requires a CDATA attribute + with a tab character to be passed through as one space. + + Basically an output test, this requires a CDATA attribute + with a newline character to be passed through as one space. + + Basically an output test, this requires a CDATA attribute + with a return character to be passed through as one space. + + This tests normalization of end-of-line characters (CRLF) + within entities to LF, primarily as an output test. + + Test demonstrates that an attribute can have a null value. + + Basically an output test, this requires that a CDATA + attribute with a CRLF be normalized to one space. + + Character references expanding to spaces doesn't affect + treatment of attributes. + + Test demonstrates shows the use of content particles within the element content. + + Test demonstrates that it is not an error to have attributes declared for an element not itself declared. + + Test demonstrates that all text within a valid CDATA section is considered text and not recognized as markup. + + Test demonstrates that an entity reference is processed by recursively processing the replacement text of the entity. + + Test demonstrates that a line break within CDATA will be normalized. + + Test demonstrates that entity expansion is done while processing entity declarations. + + Test demonstrates that entity expansion is done while processing entity declarations. + + Comments may contain any legal XML characters; + only the string "--" is disallowed. + + + + + Test demonstrates the use of an ExternalID within a document type definition. + + Test demonstrates the use of an ExternalID within a document type definition. + + Test demonstrates the expansion of an external parameter entity that declares an attribute. + + Expands an external parameter entity in two different ways, + with one of them declaring an attribute. + + Test demonstrates the expansion of an external parameter entity that declares an attribute. + + Test demonstrates that when more than one definition is provided for the same attribute of a given element type only the first declaration is binding. + + Test demonstrates the use of an Attribute list declaration within an external entity. + + Test demonstrates that an external identifier may include a public identifier. + + Test demonstrates that an external identifier may include a public identifier. + + Test demonstrates that when more that one definition is provided for the same attribute of a given element type only the first declaration is binding. + + Test demonstrates a parameter entity declaration whose parameter entity definition is an ExternalID. + + Test demonstrates an enternal parsed entity that begins with a text declaration. + + Test demonstrates the use of the conditional section INCLUDE that will include its contents as part of the DTD. + + Test demonstrates the use of the conditional section INCLUDE that will include its contents as part of the DTD. The keyword is a parameter-entity reference. + + Test demonstrates the use of the conditonal section IGNORE the will ignore its content from being part of the DTD. The keyword is a parameter-entity reference. + + Test demonstrates the use of the conditional section INCLUDE that will include its contents as part of the DTD. The keyword is a parameter-entity reference. + + Test demonstrates a parameter entity declaration that contains an attribute list declaration. + + Test demonstrates an EnternalID whose contents contain an parameter entity declaration and a attribute list definition. + + Test demonstrates that a parameter entity will be expanded with spaces on either side. + + Parameter entities expand with spaces on either side. + + Test demonstrates a parameter entity declaration that contains a partial attribute list declaration. + + Test demonstrates the use of a parameter-entity reference as a keyword of a conditional section. The parameter entity must be replaced by its content before the processor decides whether to include the conditional section. + + Test demonstrates the use of a parameter entity reference within an attribute list declaration. + + + Constructs an <!ATTLIST...> declaration from several PEs. + + Test demonstrates that when more that one definition is provided for the same entity only the first declaration is binding. + + Test demonstrates that when more that one definition is provided for the same attribute of a given element type only the first declaration is binding. + + Test demonstrates a parameter entity reference whose value is NULL. + + Test demonstrates the use of the conditional section INCLUDE that will include its contents. + + Test demonstrates the use of the conditonal section IGNORE the will ignore its content from being used. + + Test demonstrates the use of the conditonal section IGNORE the will ignore its content from being used. + + Expands a general entity which contains a CDATA section with + what looks like a markup declaration (but is just text since + it's in a CDATA section). + + + + + A combination of carriage return line feed in an external entity must + be normalized to a single newline. + + A carriage return (also CRLF) in an external entity must + be normalized to a single newline. + + Test demonstrates that the content of an element can be empty. In this case the external entity is an empty file. + + A carriage return (also CRLF) in an external entity must + be normalized to a single newline. + + Test demonstrates the use of optional character and content particles within an element content. The test also show the use of external entity. + + Test demonstrates the use of optional character and content particles within mixed element content. The test also shows the use of an external entity and that a carriage control line feed in an external entity must be normalized to a single newline. + + Test demonstrates the use of external entity and how replacement +text is retrieved and processed. + Test demonstrates the use of external +entity and how replacement text is retrieved and processed. Also tests the use of an +EncodingDecl of UTF-16. + + A carriage return (also CRLF) in an external entity must + be normalized to a single newline. + + Test demonstrates the use of a public identifier with and external entity. +The test also show that a carriage control line feed combination in an external +entity must be normalized to a single newline. + + Test demonstrates both internal and external entities and that processing of entity references may be required to produce the correct replacement text. + + Test demonstrates that whitespace is handled by adding a single whitespace to the normalized value in the attribute list. + + Test demonstrates use of characters outside of normal ASCII range. +