[3.11] GH-107678: Improve Unicode handling clarity in ``library/re.rst`` (GH-107679) (#113966)

GH-107678: Improve Unicode handling clarity in ``library/re.rst`` (GH-107679) (cherry picked from commit c9b8a22f34) Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
2024-01-12 01:02:35 +01:00 · 2024-01-12 01:02:35 +01:00 · 535b1dedb3
parent 71288ed8b0
commit 535b1dedb3
1 changed files with 144 additions and 91 deletions
--- a/Doc/library/re.rst
+++ b/Doc/library/re.rst
@ -17,7 +17,7 @@ those found in Perl.
 Both patterns and strings to be searched can be Unicode strings (:class:`str`)
 as well as 8-bit strings (:class:`bytes`).
 However, Unicode strings and 8-bit strings cannot be mixed:
-that is, you cannot match a Unicode string with a byte pattern or
+that is, you cannot match a Unicode string with a bytes pattern or
 vice-versa; similarly, when asking for a substitution, the replacement
 string must be of the same type as both the pattern and the search string.

@ -257,8 +257,7 @@ The special characters are:
   .. index:: single: \ (backslash); in regular expressions

   * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
-     inside a set, although the characters they match depends on whether
-     :const:`ASCII` or :const:`LOCALE` mode is in force.
+     inside a set, although the characters they match depend on the flags_ used.

   .. index:: single: ^ (caret); in regular expressions

@ -326,18 +325,24 @@ The special characters are:
   currently supported extensions.

 ``(?aiLmsux)``
-   (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
-   ``'s'``, ``'u'``, ``'x'``.)  The group matches the empty string; the
-   letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
-   :const:`re.I` (ignore case), :const:`re.L` (locale dependent),
-   :const:`re.M` (multi-line), :const:`re.S` (dot matches all),
-   :const:`re.U` (Unicode matching), and :const:`re.X` (verbose),
-   for the entire regular expression.
+   (One or more letters from the set
+   ``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``.)
+   The group matches the empty string;
+   the letters set the corresponding flags for the entire regular expression:
+
+   * :const:`re.A` (ASCII-only matching)
+   * :const:`re.I` (ignore case)
+   * :const:`re.L` (locale dependent)
+   * :const:`re.M` (multi-line)
+   * :const:`re.S` (dot matches all)
+   * :const:`re.U` (Unicode matching)
+   * :const:`re.X` (verbose)
+
   (The flags are described in :ref:`contents-of-module-re`.)
   This is useful if you wish to include the flags as part of the
   regular expression, instead of passing a *flag* argument to the
-   :func:`re.compile` function.  Flags should be used first in the
-   expression string.
+   :func:`re.compile` function.
+   Flags should be used first in the expression string.

   .. versionchanged:: 3.11
      This construction can only be used at the start of the expression.
@ -351,14 +356,20 @@ The special characters are:
   pattern.

 ``(?aiLmsux-imsx:...)``
-   (Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
-   ``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by
+   (Zero or more letters from the set
+   ``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``,
+   optionally followed by ``'-'`` followed by
   one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
-   The letters set or remove the corresponding flags:
-   :const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
-   :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
-   :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
-   and :const:`re.X` (verbose), for the part of the expression.
+   The letters set or remove the corresponding flags for the part of the expression:
+
+   * :const:`re.A` (ASCII-only matching)
+   * :const:`re.I` (ignore case)
+   * :const:`re.L` (locale dependent)
+   * :const:`re.M` (multi-line)
+   * :const:`re.S` (dot matches all)
+   * :const:`re.U` (Unicode matching)
+   * :const:`re.X` (verbose)
+
   (The flags are described in :ref:`contents-of-module-re`.)

   The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
@ -366,7 +377,7 @@ The special characters are:
   when one of them appears in an inline group, it overrides the matching mode
   in the enclosing group.  In Unicode patterns ``(?a:...)`` switches to
   ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
-   (default).  In byte pattern ``(?L:...)`` switches to locale depending
+   (default).  In bytes patterns ``(?L:...)`` switches to locale dependent
   matching, and ``(?a:...)`` switches to ASCII-only matching (default).
   This override is only in effect for the narrow inline group, and the
   original matching mode is restored outside of the group.
@ -528,47 +539,61 @@ character ``'$'``.

 ``\b``
   Matches the empty string, but only at the beginning or end of a word.
-   A word is defined as a sequence of word characters.  Note that formally,
-   ``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
-   (or vice versa), or between ``\w`` and the beginning/end of the string.
-   This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
-   ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
+   A word is defined as a sequence of word characters.
+   Note that formally, ``\b`` is defined as the boundary
+   between a ``\w`` and a ``\W`` character (or vice versa),
+   or between ``\w`` and the beginning or end of the string.
+   This means that ``r'\bat\b'`` matches ``'at'``, ``'at.'``, ``'(at)'``,
+   and ``'as at ay'`` but not ``'attempt'`` or ``'atlas'``.

-   By default Unicode alphanumerics are the ones used in Unicode patterns, but
-   this can be changed by using the :const:`ASCII` flag.  Word boundaries are
-   determined by the current locale if the :const:`LOCALE` flag is used.
-   Inside a character range, ``\b`` represents the backspace character, for
-   compatibility with Python's string literals.
+   The default word characters in Unicode (str) patterns
+   are Unicode alphanumerics and the underscore,
+   but this can be changed by using the :py:const:`~re.ASCII` flag.
+   Word boundaries are determined by the current locale
+   if the :py:const:`~re.LOCALE` flag is used.
+
+   .. note::
+
+      Inside a character range, ``\b`` represents the backspace character,
+      for compatibility with Python's string literals.

 .. index:: single: \B; in regular expressions

 ``\B``
-   Matches the empty string, but only when it is *not* at the beginning or end
-   of a word.  This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
-   ``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
-   ``\B`` is just the opposite of ``\b``, so word characters in Unicode
-   patterns are Unicode alphanumerics or the underscore, although this can
-   be changed by using the :const:`ASCII` flag.  Word boundaries are
-   determined by the current locale if the :const:`LOCALE` flag is used.
+   Matches the empty string,
+   but only when it is *not* at the beginning or end of a word.
+   This means that ``r'at\B'`` matches ``'athens'``, ``'atom'``,
+   ``'attorney'``, but not ``'at'``, ``'at.'``, or ``'at!'``.
+   ``\B`` is the opposite of ``\b``,
+   so word characters in Unicode (str) patterns
+   are Unicode alphanumerics or the underscore,
+   although this can be changed by using the :py:const:`~re.ASCII` flag.
+   Word boundaries are determined by the current locale
+   if the :py:const:`~re.LOCALE` flag is used.

 .. index:: single: \d; in regular expressions

 ``\d``
   For Unicode (str) patterns:
-      Matches any Unicode decimal digit (that is, any character in
-      Unicode character category [Nd]).  This includes ``[0-9]``, and
-      also many other digit characters.  If the :const:`ASCII` flag is
-      used only ``[0-9]`` is matched.
+      Matches any Unicode decimal digit
+      (that is, any character in Unicode character category `[Nd]`__).
+      This includes ``[0-9]``, and also many other digit characters.
+
+      Matches ``[0-9]`` if the :py:const:`~re.ASCII` flag is used.
+
+      __ https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf#G134153

   For 8-bit (bytes) patterns:
-      Matches any decimal digit; this is equivalent to ``[0-9]``.
+      Matches any decimal digit in the ASCII character set;
+      this is equivalent to ``[0-9]``.

 .. index:: single: \D; in regular expressions

 ``\D``
-   Matches any character which is not a decimal digit. This is
-   the opposite of ``\d``. If the :const:`ASCII` flag is used this
-   becomes the equivalent of ``[^0-9]``.
+   Matches any character which is not a decimal digit.
+   This is the opposite of ``\d``.
+
+   Matches ``[^0-9]`` if the :py:const:`~re.ASCII` flag is used.

 .. index:: single: \s; in regular expressions

@ -577,8 +602,9 @@ character ``'$'``.
      Matches Unicode whitespace characters (which includes
      ``[ \t\n\r\f\v]``, and also many other characters, for example the
      non-breaking spaces mandated by typography rules in many
-      languages). If the :const:`ASCII` flag is used, only
-      ``[ \t\n\r\f\v]`` is matched.
+      languages).
+
+      Matches ``[ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used.

   For 8-bit (bytes) patterns:
      Matches characters considered whitespace in the ASCII character set;
@ -588,30 +614,39 @@ character ``'$'``.

 ``\S``
   Matches any character which is not a whitespace character. This is
-   the opposite of ``\s``. If the :const:`ASCII` flag is used this
-   becomes the equivalent of ``[^ \t\n\r\f\v]``.
+   the opposite of ``\s``.
+
+   Matches ``[^ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used.

 .. index:: single: \w; in regular expressions

 ``\w``
   For Unicode (str) patterns:
-      Matches Unicode word characters; this includes alphanumeric characters (as defined by :meth:`str.isalnum`)
+      Matches Unicode word characters;
+      this includes all Unicode alphanumeric characters
+      (as defined by :py:meth:`str.isalnum`),
      as well as the underscore (``_``).
-      If the :const:`ASCII` flag is used, only ``[a-zA-Z0-9_]`` is matched.
+
+      Matches ``[a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used.

   For 8-bit (bytes) patterns:
      Matches characters considered alphanumeric in the ASCII character set;
-      this is equivalent to ``[a-zA-Z0-9_]``.  If the :const:`LOCALE` flag is
-      used, matches characters considered alphanumeric in the current locale
-      and the underscore.
+      this is equivalent to ``[a-zA-Z0-9_]``.
+      If the :py:const:`~re.LOCALE` flag is used,
+      matches characters considered alphanumeric in the current locale and the underscore.

 .. index:: single: \W; in regular expressions

 ``\W``
-   Matches any character which is not a word character. This is
-   the opposite of ``\w``. If the :const:`ASCII` flag is used this
-   becomes the equivalent of ``[^a-zA-Z0-9_]``.  If the :const:`LOCALE` flag is
-   used, matches characters which are neither alphanumeric in the current locale
+   Matches any character which is not a word character.
+   This is the opposite of ``\w``.
+   By default, matches non-underscore (``_``) characters
+   for which :py:meth:`str.isalnum` returns ``False``.
+
+   Matches ``[^a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used.
+
+   If the :py:const:`~re.LOCALE` flag is used,
+   matches characters which are neither alphanumeric in the current locale
   nor the underscore.

 .. index:: single: \Z; in regular expressions
@ -643,9 +678,11 @@ accepted by the regular expression parser::
 (Note that ``\b`` is used to represent word boundaries, and means "backspace"
 only inside character classes.)

-``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode
-patterns.  In bytes patterns they are errors.  Unknown escapes of ASCII
-letters are reserved for future use and treated as errors.
+``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are
+only recognized in Unicode (str) patterns.
+In bytes patterns they are errors.
+Unknown escapes of ASCII letters are reserved
+for future use and treated as errors.

 Octal escapes are included in a limited form.  If the first digit is a 0, or if
 there are three octal digits, it is considered an octal escape. Otherwise, it is
@ -693,30 +730,37 @@ Flags

   Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
   perform ASCII-only matching instead of full Unicode matching.  This is only
-   meaningful for Unicode patterns, and is ignored for byte patterns.
+   meaningful for Unicode (str) patterns, and is ignored for bytes patterns.
+
   Corresponds to the inline flag ``(?a)``.

-   Note that for backward compatibility, the :const:`re.U` flag still
-   exists (as well as its synonym :const:`re.UNICODE` and its embedded
-   counterpart ``(?u)``), but these are redundant in Python 3 since
-   matches are Unicode by default for strings (and Unicode matching
-   isn't allowed for bytes).
+   .. note::
+
+      The :py:const:`~re.U` flag still exists for backward compatibility,
+      but is redundant in Python 3 since
+      matches are Unicode by default for ``str`` patterns,
+      and Unicode matching isn't allowed for bytes patterns.
+      :py:const:`~re.UNICODE` and the inline flag ``(?u)`` are similarly redundant.


 .. data:: DEBUG

   Display debug information about compiled expression.
+
   No corresponding inline flag.


 .. data:: I
          IGNORECASE

-   Perform case-insensitive matching; expressions like ``[A-Z]`` will also
-   match lowercase letters.  Full Unicode matching (such as ``Ü`` matching
-   ``ü``) also works unless the :const:`re.ASCII` flag is used to disable
-   non-ASCII matches.  The current locale does not change the effect of this
-   flag unless the :const:`re.LOCALE` flag is also used.
+   Perform case-insensitive matching;
+   expressions like ``[A-Z]`` will also  match lowercase letters.
+   Full Unicode matching (such as ``Ü`` matching ``ü``)
+   also works unless the :py:const:`~re.ASCII` flag
+   is used to disable non-ASCII matches.
+   The current locale does not change the effect of this flag
+   unless the :py:const:`~re.LOCALE` flag is also used.
+
   Corresponds to the inline flag ``(?i)``.

   Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
@ -724,29 +768,35 @@ Flags
   letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
   letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
   'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
-   If the :const:`ASCII` flag is used, only letters 'a' to 'z'
+   If the :py:const:`~re.ASCII` flag is used, only letters 'a' to 'z'
   and 'A' to 'Z' are matched.

 .. data:: L
          LOCALE

   Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
-   dependent on the current locale.  This flag can be used only with bytes
-   patterns.  The use of this flag is discouraged as the locale mechanism
-   is very unreliable, it only handles one "culture" at a time, and it only
-   works with 8-bit locales.  Unicode matching is already enabled by default
-   in Python 3 for Unicode (str) patterns, and it is able to handle different
-   locales/languages.
+   dependent on the current locale.
+   This flag can be used only with bytes patterns.
+
   Corresponds to the inline flag ``(?L)``.

+   .. warning::
+
+      This flag is discouraged; consider Unicode matching instead.
+      The locale mechanism is very unreliable
+      as it only handles one "culture" at a time
+      and only works with 8-bit locales.
+      Unicode matching is enabled by default for Unicode (str) patterns
+      and it is able to handle different locales and languages.
+
   .. versionchanged:: 3.6
-      :const:`re.LOCALE` can be used only with bytes patterns and is
-      not compatible with :const:`re.ASCII`.
+      :py:const:`~re.LOCALE` can be used only with bytes patterns
+      and is not compatible with :py:const:`~re.ASCII`.

   .. versionchanged:: 3.7
-      Compiled regular expression objects with the :const:`re.LOCALE` flag no
-      longer depend on the locale at compile time.  Only the locale at
-      matching time affects the result of matching.
+      Compiled regular expression objects with the :py:const:`~re.LOCALE` flag
+      no longer depend on the locale at compile time.
+      Only the locale at matching time affects the result of matching.


 .. data:: M
@ -758,6 +808,7 @@ Flags
   end of each line (immediately preceding each newline).  By default, ``'^'``
   matches only at the beginning of the string, and ``'$'`` only at the end of the
   string and immediately before the newline (if any) at the end of the string.
+
   Corresponds to the inline flag ``(?m)``.

 .. data:: NOFLAG
@ -777,19 +828,19 @@ Flags

   Make the ``'.'`` special character match any character at all, including a
   newline; without this flag, ``'.'`` will match anything *except* a newline.
+
   Corresponds to the inline flag ``(?s)``.


 .. data:: U
          UNICODE

-   In Python 2, this flag made :ref:`special sequences <re-special-sequences>`
-   include Unicode characters in matches. Since Python 3, Unicode characters
-   are matched by default.
+   In Python 3, Unicode characters are matched by default
+   for ``str`` patterns.
+   This flag is therefore redundant with **no effect**
+   and is only kept for backward compatibility.

-   See :const:`A` for restricting matching on ASCII characters instead.
-
-   This flag is only kept for backward compatibility.
+   See :py:const:`~re.ASCII` to restrict matching to ASCII characters instead.

 .. data:: X
          VERBOSE
@ -913,6 +964,8 @@ Functions
   Empty matches for the pattern split the string only when not adjacent
   to a previous empty match.

+   .. code:: pycon
+
      >>> re.split(r'\b', 'Words, words, words.')
      ['', 'Words', ', ', 'words', ', ', 'words', '.']
      >>> re.split(r'\W*', '...words...')
@ -1229,7 +1282,7 @@ Regular Expression Objects

   The regex matching flags.  This is a combination of the flags given to
   :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
-   flags such as :data:`UNICODE` if the pattern is a Unicode string.
+   flags such as :py:const:`~re.UNICODE` if the pattern is a Unicode string.


 .. attribute:: Pattern.groups