Commit Graph

67 Commits

Author SHA1 Message Date
Waylan Limberg 53383e90e4
gh-86155: Fix data loss after unclosed script or style tag in HTMLParser (GH-22658)
When calling .close() the HTMLParser should flush all remaining content,
even when that content is in an unclosed script or style tag.
2025-05-10 17:36:06 +00:00
Ezio Melotti 76c0b01bc4
gh-77057: Fix handling of invalid markup declarations in HTMLParser (GH-9295)
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
2025-05-10 17:31:43 +03:00
Sascha Ißbrücker 77b14a6d58
gh-69426: HTMLParser: only unescape properly terminated character entities in attribute values (GH-95215)
According to the HTML5 spec, named character references in attribute values
should only be processed if they are not followed by an ASCII alphanumeric,
or an equals sign.

https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state
2025-05-07 18:49:49 +03:00
Dong-hee Na 157aef79b0
gh-95813: Improve HTMLParser from the view of inheritance (#95874)
* gh-95813: Improve HTMLParser from the view of inheritance

* gh-95813: Add unittest

* Address code review
2022-08-18 13:16:33 +02:00
Karl Dubost 9eb11a139f
bpo-41748: Handles unquoted attributes with commas (#24072)
* bpo-41748: Adds tests for unquoted attributes with comma

* bpo-41748: Handles unquoted attributes with comma

* bpo-41748: Addresses review comments

* bpo-41748: Addresses review comments

* Adds more test cases
* Simplifies the regex for handling spaces

* bpo-41748: Moves attributes tests under the right class

* bpo-41748: Addresses review about duplicate attributes

* bpo-41748: Adds NEWS.d entry for this patch
2021-02-01 21:32:50 +01:00
Inada Naoki fae0ed5099
bpo-37328: remove deprecated HTMLParser.unescape (GH-14186)
It is deprecated since Python 3.4.
2019-08-27 11:48:06 +09:00
R David Murray 44b548dda8 #27364: fix "incorrect" uses of escape character in the stdlib.
And most of the tools.

Patch by Emanual Barry, reviewed by me, Serhiy Storchaka, and
Martin Panter.
2016-09-08 13:59:53 -04:00
Serhiy Storchaka 597d15afe4 Issue #23277: Remove unused support.run_unittest import. 2016-04-24 13:45:58 +03:00
Ezio Melotti 20a2c6482e #23144: merge with 3.4. 2015-09-06 21:44:45 +03:00
Ezio Melotti 6f2bb98966 #23144: Make sure that HTMLParser.feed() returns all the data, even when convert_charrefs is True. 2015-09-06 21:38:06 +03:00
Ezio Melotti 6fc16d81af #21047: set the default value for the *convert_charrefs* argument of HTMLParser to True. Patch by Berker Peksag. 2014-08-02 18:36:12 +03:00
Ezio Melotti 73a4359eb0 #15114: the strict mode and argument of HTMLParser, HTMLParser.error, and the HTMLParserError exception have been removed. 2014-08-02 14:10:30 +03:00
Ezio Melotti 153d97b24e #20288: merge with 3.3. 2014-02-01 21:22:26 +02:00
Ezio Melotti f27b9a741a #20288: fix handling of invalid numeric charrefs in HTMLParser. 2014-02-01 21:21:01 +02:00
Ezio Melotti 95401c5f6b #13633: Added a new convert_charrefs keyword arg to HTMLParser that, when True, automatically converts all character references. 2013-11-23 19:52:05 +02:00
Ezio Melotti f6de9eb2bb #19688: add back and deprecate the internal HTMLParser.unescape() method. 2013-11-22 05:49:29 +02:00
Ezio Melotti 4a9ee26750 #2927: Added the unescape() function to the html module. 2013-11-19 20:28:45 +02:00
Ezio Melotti b7038817fe #19480: merge with 3.3. 2013-11-07 18:35:27 +02:00
Ezio Melotti 7165d8b9ba #19480: HTMLParser now accepts all valid start-tag names as defined by the HTML5 standard. 2013-11-07 18:33:24 +02:00
Ezio Melotti 1943c8a112 Merge test_htmlparser changes from 3.3. 2013-11-02 17:50:02 +02:00
Ezio Melotti 5028f4d461 Use unittest.main() in test_htmlparser. 2013-11-02 17:49:08 +02:00
Ezio Melotti 88ebfb129b #15114: The html.parser module now raises a DeprecationWarning when the strict argument of HTMLParser or the HTMLParser.error method are used. 2013-11-02 17:08:24 +02:00
Ezio Melotti 8e596a765c #17802: Fix an UnboundLocalError in html.parser. Initial tests by Thomas Barlow. 2013-05-01 16:18:25 +03:00
Ezio Melotti 46495182d0 #15156: HTMLParser now uses the new "html.entities.html5" dictionary. 2012-06-24 22:02:56 +02:00
Ezio Melotti 3861d8b271 #15114: the strict mode of HTMLParser and the HTMLParseError exception are deprecated now that the parser is able to parse invalid markup. 2012-06-23 15:27:51 +02:00
Ezio Melotti 0780b6bc58 #14538: HTMLParser can now parse correctly start tags that contain a bare /. 2012-04-18 19:18:22 -06:00
Ezio Melotti 29877e8e04 HTMLParser is now able to handle slashes in the start tag. 2012-02-21 09:25:00 +02:00
Ezio Melotti e31ddedb0e Fix an index and clean up comments. 2012-02-13 20:20:00 +02:00
Ezio Melotti f4ab491901 Improve handling of declarations in HTMLParser. 2012-02-13 15:50:37 +02:00
Ezio Melotti 86f67123be Fix htmlparser tests to always use the right collector. 2012-02-13 14:11:27 +02:00
Ezio Melotti 5211ffe4df #13993: HTMLParser is now able to handle broken end tags when strict=False. 2012-02-13 11:24:50 +02:00
Ezio Melotti fa3702dc28 #13960: HTMLParser is now able to handle broken comments when strict=False. 2012-02-10 10:45:44 +02:00
Ezio Melotti 62f3d0300e #13576: add tests about the handling of (possibly broken) condcoms. 2011-12-19 07:29:03 +02:00
Ezio Melotti 15cb489234 #13358: HTMLParser now calls handle_data only once for each CDATA. 2011-11-18 18:01:49 +02:00
Ezio Melotti c2fe57762b #1745761, #755670, #13357, #12629, #1200313: improve attribute handling in HTMLParser. 2011-11-14 18:53:33 +02:00
Ezio Melotti b245ed1cdf Group tests about attributes in a separate class. 2011-11-14 18:13:22 +02:00
Ezio Melotti c1e73c30e9 Make sure that the tolerant parser still parses valid HTML correctly. 2011-11-01 18:57:15 +02:00
Ezio Melotti b9a48f7144 Avoid reusing the same collector in the tests. 2011-11-01 15:00:59 +02:00
Ezio Melotti 18b0e5b79b #12008: add a test. 2011-11-01 14:42:54 +02:00
Ezio Melotti 7de56f6a04 #670664: Fix HTMLParser to correctly handle the content of ``<script>...</script>`` and ``<style>...</style>``. 2011-11-01 14:12:22 +02:00
Ezio Melotti f50ffa94ab #13273: fix a bug that prevented HTMLParser to properly detect some tags when strict=False. 2011-10-28 13:21:09 +03:00
Ezio Melotti d9e0b068af #12888: Fix a bug in HTMLParser.unescape that prevented it to escape more than 128 entities. Patch by Peter Otten. 2011-09-05 17:11:06 +03:00
Ezio Melotti 2e3607c1e7 #7311: fix html.parser to accept non-ASCII attribute values. 2011-04-07 22:03:31 +03:00
Senthil Kumaran 164540fee1 Fix Issue10759 - html.parser.unescape() fails on HTML entities with incorrect syntax 2010-12-28 15:55:16 +00:00
R. David Murray b579dba119 #1486713: Add a tolerant mode to HTMLParser.
The motivation for adding this option is that the the functionality it
provides used to be provided by sgmllib in Python2, and was used by,
for example, BeautifulSoup.  Without this option, the Python3 version
of BeautifulSoup and the many programs that use it are crippled.

The original patch was by 'kxroberto'.  I modified it heavily but kept his
heuristics and test.  I also added additional heuristics to fix #975556,
#1046092, and part of #6191.  This patch should be completely backward
compatible:  the behavior with the default strict=True is unchanged.
2010-12-03 04:06:39 +00:00
Victor Stinner e021f4b206 Recorded merge of revisions 81500-81501 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk

........
  r81500 | victor.stinner | 2010-05-24 23:33:24 +0200 (lun., 24 mai 2010) | 2 lines

  Issue #6662: Fix parsing of malformatted charref (&#bad;)
........
  r81501 | victor.stinner | 2010-05-24 23:37:28 +0200 (lun., 24 mai 2010) | 2 lines

  Add the author of the last fix (Issue #6662)
........
2010-05-24 21:46:25 +00:00
Benjamin Peterson 5a53fdeee8 Merged revisions 78678,78680,78682 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk

........
  r78678 | benjamin.peterson | 2010-03-04 21:07:59 -0600 (Thu, 04 Mar 2010) | 1 line

  set svn:eol-style
........
  r78680 | benjamin.peterson | 2010-03-04 21:15:07 -0600 (Thu, 04 Mar 2010) | 1 line

  set svn:eol-style on Lib files
........
  r78682 | benjamin.peterson | 2010-03-04 21:20:06 -0600 (Thu, 04 Mar 2010) | 1 line

  remove the svn:executable property from files that don't have shebang lines
........
2010-03-05 03:33:11 +00:00
Mark Dickinson f64dcf3ce0 Change test_htmlparser to reflect the HTMLParser -> html.parser
rename in r63439.

Also fix one occurrence of unichr() in html.parser.
2008-05-21 13:51:18 +00:00
Benjamin Peterson ee8712cda4 #2621 rename test.test_support to test.support 2008-05-20 21:35:26 +00:00
Christian Heimes 05e8be17fd Merged revisions 60990-61002 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk

........
  r60990 | eric.smith | 2008-02-23 17:05:26 +0100 (Sat, 23 Feb 2008) | 1 line

  Removed duplicate Py_CHARMASK define.  It's already defined in Python.h.
........
  r60991 | andrew.kuchling | 2008-02-23 17:23:05 +0100 (Sat, 23 Feb 2008) | 4 lines

  #1330538: Improve comparison of xmlrpclib.DateTime and datetime instances.
  Remove automatic handling of datetime.date and datetime.time.
  This breaks backward compatibility, but python-dev discussion was strongly
  against this automatic conversion; see the bug for a link.
........
  r60994 | andrew.kuchling | 2008-02-23 17:39:43 +0100 (Sat, 23 Feb 2008) | 1 line

  #835521: Add index entries for various pickle-protocol methods and attributes
........
  r60995 | andrew.kuchling | 2008-02-23 18:10:46 +0100 (Sat, 23 Feb 2008) | 2 lines

  #1433694: minidom's .normalize() failed to set .nextSibling for last element.
  Fix by Malte Helmert
........
  r61000 | christian.heimes | 2008-02-23 18:40:11 +0100 (Sat, 23 Feb 2008) | 1 line

  Patch #2167 from calvin: Remove unused imports
........
  r61001 | christian.heimes | 2008-02-23 18:42:31 +0100 (Sat, 23 Feb 2008) | 1 line

  Patch #1957: syslogmodule: Release GIL when calling syslog(3)
........
  r61002 | christian.heimes | 2008-02-23 18:52:07 +0100 (Sat, 23 Feb 2008) | 2 lines

  Issue #2051 and patch from Alexander Belopolsky:
  Permission for pyc and pyo files are inherited from the py file.
........
2008-02-23 18:30:17 +00:00