[3.11] GH-107678: Improve Unicode handling clarity in ``library/re.rst`` (GH-107679) (#113966)

GH-107678: Improve Unicode handling clarity in ``library/re.rst`` (GH-107679)
(cherry picked from commit c9b8a22f34)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
This commit is contained in:
Miss Islington (bot) 2024-01-12 01:02:35 +01:00 committed by GitHub
parent 71288ed8b0
commit 535b1dedb3
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 144 additions and 91 deletions

View File

@ -17,7 +17,7 @@ those found in Perl.
Both patterns and strings to be searched can be Unicode strings (:class:`str`)
as well as 8-bit strings (:class:`bytes`).
However, Unicode strings and 8-bit strings cannot be mixed:
that is, you cannot match a Unicode string with a byte pattern or
that is, you cannot match a Unicode string with a bytes pattern or
vice-versa; similarly, when asking for a substitution, the replacement
string must be of the same type as both the pattern and the search string.
@ -257,8 +257,7 @@ The special characters are:
.. index:: single: \ (backslash); in regular expressions
* Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
inside a set, although the characters they match depends on whether
:const:`ASCII` or :const:`LOCALE` mode is in force.
inside a set, although the characters they match depend on the flags_ used.
.. index:: single: ^ (caret); in regular expressions
@ -326,18 +325,24 @@ The special characters are:
currently supported extensions.
``(?aiLmsux)``
(One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the
letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
:const:`re.I` (ignore case), :const:`re.L` (locale dependent),
:const:`re.M` (multi-line), :const:`re.S` (dot matches all),
:const:`re.U` (Unicode matching), and :const:`re.X` (verbose),
for the entire regular expression.
(One or more letters from the set
``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``.)
The group matches the empty string;
the letters set the corresponding flags for the entire regular expression:
* :const:`re.A` (ASCII-only matching)
* :const:`re.I` (ignore case)
* :const:`re.L` (locale dependent)
* :const:`re.M` (multi-line)
* :const:`re.S` (dot matches all)
* :const:`re.U` (Unicode matching)
* :const:`re.X` (verbose)
(The flags are described in :ref:`contents-of-module-re`.)
This is useful if you wish to include the flags as part of the
regular expression, instead of passing a *flag* argument to the
:func:`re.compile` function. Flags should be used first in the
expression string.
:func:`re.compile` function.
Flags should be used first in the expression string.
.. versionchanged:: 3.11
This construction can only be used at the start of the expression.
@ -351,14 +356,20 @@ The special characters are:
pattern.
``(?aiLmsux-imsx:...)``
(Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by
(Zero or more letters from the set
``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``,
optionally followed by ``'-'`` followed by
one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
The letters set or remove the corresponding flags:
:const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
:const:`re.L` (locale dependent), :const:`re.M` (multi-line),
:const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
and :const:`re.X` (verbose), for the part of the expression.
The letters set or remove the corresponding flags for the part of the expression:
* :const:`re.A` (ASCII-only matching)
* :const:`re.I` (ignore case)
* :const:`re.L` (locale dependent)
* :const:`re.M` (multi-line)
* :const:`re.S` (dot matches all)
* :const:`re.U` (Unicode matching)
* :const:`re.X` (verbose)
(The flags are described in :ref:`contents-of-module-re`.)
The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
@ -366,7 +377,7 @@ The special characters are:
when one of them appears in an inline group, it overrides the matching mode
in the enclosing group. In Unicode patterns ``(?a:...)`` switches to
ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
(default). In byte pattern ``(?L:...)`` switches to locale depending
(default). In bytes patterns ``(?L:...)`` switches to locale dependent
matching, and ``(?a:...)`` switches to ASCII-only matching (default).
This override is only in effect for the narrow inline group, and the
original matching mode is restored outside of the group.
@ -528,47 +539,61 @@ character ``'$'``.
``\b``
Matches the empty string, but only at the beginning or end of a word.
A word is defined as a sequence of word characters. Note that formally,
``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
(or vice versa), or between ``\w`` and the beginning/end of the string.
This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
A word is defined as a sequence of word characters.
Note that formally, ``\b`` is defined as the boundary
between a ``\w`` and a ``\W`` character (or vice versa),
or between ``\w`` and the beginning or end of the string.
This means that ``r'\bat\b'`` matches ``'at'``, ``'at.'``, ``'(at)'``,
and ``'as at ay'`` but not ``'attempt'`` or ``'atlas'``.
By default Unicode alphanumerics are the ones used in Unicode patterns, but
this can be changed by using the :const:`ASCII` flag. Word boundaries are
determined by the current locale if the :const:`LOCALE` flag is used.
Inside a character range, ``\b`` represents the backspace character, for
compatibility with Python's string literals.
The default word characters in Unicode (str) patterns
are Unicode alphanumerics and the underscore,
but this can be changed by using the :py:const:`~re.ASCII` flag.
Word boundaries are determined by the current locale
if the :py:const:`~re.LOCALE` flag is used.
.. note::
Inside a character range, ``\b`` represents the backspace character,
for compatibility with Python's string literals.
.. index:: single: \B; in regular expressions
``\B``
Matches the empty string, but only when it is *not* at the beginning or end
of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
``\B`` is just the opposite of ``\b``, so word characters in Unicode
patterns are Unicode alphanumerics or the underscore, although this can
be changed by using the :const:`ASCII` flag. Word boundaries are
determined by the current locale if the :const:`LOCALE` flag is used.
Matches the empty string,
but only when it is *not* at the beginning or end of a word.
This means that ``r'at\B'`` matches ``'athens'``, ``'atom'``,
``'attorney'``, but not ``'at'``, ``'at.'``, or ``'at!'``.
``\B`` is the opposite of ``\b``,
so word characters in Unicode (str) patterns
are Unicode alphanumerics or the underscore,
although this can be changed by using the :py:const:`~re.ASCII` flag.
Word boundaries are determined by the current locale
if the :py:const:`~re.LOCALE` flag is used.
.. index:: single: \d; in regular expressions
``\d``
For Unicode (str) patterns:
Matches any Unicode decimal digit (that is, any character in
Unicode character category [Nd]). This includes ``[0-9]``, and
also many other digit characters. If the :const:`ASCII` flag is
used only ``[0-9]`` is matched.
Matches any Unicode decimal digit
(that is, any character in Unicode character category `[Nd]`__).
This includes ``[0-9]``, and also many other digit characters.
Matches ``[0-9]`` if the :py:const:`~re.ASCII` flag is used.
__ https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf#G134153
For 8-bit (bytes) patterns:
Matches any decimal digit; this is equivalent to ``[0-9]``.
Matches any decimal digit in the ASCII character set;
this is equivalent to ``[0-9]``.
.. index:: single: \D; in regular expressions
``\D``
Matches any character which is not a decimal digit. This is
the opposite of ``\d``. If the :const:`ASCII` flag is used this
becomes the equivalent of ``[^0-9]``.
Matches any character which is not a decimal digit.
This is the opposite of ``\d``.
Matches ``[^0-9]`` if the :py:const:`~re.ASCII` flag is used.
.. index:: single: \s; in regular expressions
@ -577,8 +602,9 @@ character ``'$'``.
Matches Unicode whitespace characters (which includes
``[ \t\n\r\f\v]``, and also many other characters, for example the
non-breaking spaces mandated by typography rules in many
languages). If the :const:`ASCII` flag is used, only
``[ \t\n\r\f\v]`` is matched.
languages).
Matches ``[ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used.
For 8-bit (bytes) patterns:
Matches characters considered whitespace in the ASCII character set;
@ -588,30 +614,39 @@ character ``'$'``.
``\S``
Matches any character which is not a whitespace character. This is
the opposite of ``\s``. If the :const:`ASCII` flag is used this
becomes the equivalent of ``[^ \t\n\r\f\v]``.
the opposite of ``\s``.
Matches ``[^ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used.
.. index:: single: \w; in regular expressions
``\w``
For Unicode (str) patterns:
Matches Unicode word characters; this includes alphanumeric characters (as defined by :meth:`str.isalnum`)
Matches Unicode word characters;
this includes all Unicode alphanumeric characters
(as defined by :py:meth:`str.isalnum`),
as well as the underscore (``_``).
If the :const:`ASCII` flag is used, only ``[a-zA-Z0-9_]`` is matched.
Matches ``[a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used.
For 8-bit (bytes) patterns:
Matches characters considered alphanumeric in the ASCII character set;
this is equivalent to ``[a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
used, matches characters considered alphanumeric in the current locale
and the underscore.
this is equivalent to ``[a-zA-Z0-9_]``.
If the :py:const:`~re.LOCALE` flag is used,
matches characters considered alphanumeric in the current locale and the underscore.
.. index:: single: \W; in regular expressions
``\W``
Matches any character which is not a word character. This is
the opposite of ``\w``. If the :const:`ASCII` flag is used this
becomes the equivalent of ``[^a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
used, matches characters which are neither alphanumeric in the current locale
Matches any character which is not a word character.
This is the opposite of ``\w``.
By default, matches non-underscore (``_``) characters
for which :py:meth:`str.isalnum` returns ``False``.
Matches ``[^a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used.
If the :py:const:`~re.LOCALE` flag is used,
matches characters which are neither alphanumeric in the current locale
nor the underscore.
.. index:: single: \Z; in regular expressions
@ -643,9 +678,11 @@ accepted by the regular expression parser::
(Note that ``\b`` is used to represent word boundaries, and means "backspace"
only inside character classes.)
``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode
patterns. In bytes patterns they are errors. Unknown escapes of ASCII
letters are reserved for future use and treated as errors.
``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are
only recognized in Unicode (str) patterns.
In bytes patterns they are errors.
Unknown escapes of ASCII letters are reserved
for future use and treated as errors.
Octal escapes are included in a limited form. If the first digit is a 0, or if
there are three octal digits, it is considered an octal escape. Otherwise, it is
@ -693,30 +730,37 @@ Flags
Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
perform ASCII-only matching instead of full Unicode matching. This is only
meaningful for Unicode patterns, and is ignored for byte patterns.
meaningful for Unicode (str) patterns, and is ignored for bytes patterns.
Corresponds to the inline flag ``(?a)``.
Note that for backward compatibility, the :const:`re.U` flag still
exists (as well as its synonym :const:`re.UNICODE` and its embedded
counterpart ``(?u)``), but these are redundant in Python 3 since
matches are Unicode by default for strings (and Unicode matching
isn't allowed for bytes).
.. note::
The :py:const:`~re.U` flag still exists for backward compatibility,
but is redundant in Python 3 since
matches are Unicode by default for ``str`` patterns,
and Unicode matching isn't allowed for bytes patterns.
:py:const:`~re.UNICODE` and the inline flag ``(?u)`` are similarly redundant.
.. data:: DEBUG
Display debug information about compiled expression.
No corresponding inline flag.
.. data:: I
IGNORECASE
Perform case-insensitive matching; expressions like ``[A-Z]`` will also
match lowercase letters. Full Unicode matching (such as ``Ü`` matching
``ü``) also works unless the :const:`re.ASCII` flag is used to disable
non-ASCII matches. The current locale does not change the effect of this
flag unless the :const:`re.LOCALE` flag is also used.
Perform case-insensitive matching;
expressions like ``[A-Z]`` will also match lowercase letters.
Full Unicode matching (such as ``Ü`` matching ``ü``)
also works unless the :py:const:`~re.ASCII` flag
is used to disable non-ASCII matches.
The current locale does not change the effect of this flag
unless the :py:const:`~re.LOCALE` flag is also used.
Corresponds to the inline flag ``(?i)``.
Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
@ -724,29 +768,35 @@ Flags
letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
'ſ' (U+017F, Latin small letter long s) and '' (U+212A, Kelvin sign).
If the :const:`ASCII` flag is used, only letters 'a' to 'z'
If the :py:const:`~re.ASCII` flag is used, only letters 'a' to 'z'
and 'A' to 'Z' are matched.
.. data:: L
LOCALE
Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
dependent on the current locale. This flag can be used only with bytes
patterns. The use of this flag is discouraged as the locale mechanism
is very unreliable, it only handles one "culture" at a time, and it only
works with 8-bit locales. Unicode matching is already enabled by default
in Python 3 for Unicode (str) patterns, and it is able to handle different
locales/languages.
dependent on the current locale.
This flag can be used only with bytes patterns.
Corresponds to the inline flag ``(?L)``.
.. warning::
This flag is discouraged; consider Unicode matching instead.
The locale mechanism is very unreliable
as it only handles one "culture" at a time
and only works with 8-bit locales.
Unicode matching is enabled by default for Unicode (str) patterns
and it is able to handle different locales and languages.
.. versionchanged:: 3.6
:const:`re.LOCALE` can be used only with bytes patterns and is
not compatible with :const:`re.ASCII`.
:py:const:`~re.LOCALE` can be used only with bytes patterns
and is not compatible with :py:const:`~re.ASCII`.
.. versionchanged:: 3.7
Compiled regular expression objects with the :const:`re.LOCALE` flag no
longer depend on the locale at compile time. Only the locale at
matching time affects the result of matching.
Compiled regular expression objects with the :py:const:`~re.LOCALE` flag
no longer depend on the locale at compile time.
Only the locale at matching time affects the result of matching.
.. data:: M
@ -758,6 +808,7 @@ Flags
end of each line (immediately preceding each newline). By default, ``'^'``
matches only at the beginning of the string, and ``'$'`` only at the end of the
string and immediately before the newline (if any) at the end of the string.
Corresponds to the inline flag ``(?m)``.
.. data:: NOFLAG
@ -777,19 +828,19 @@ Flags
Make the ``'.'`` special character match any character at all, including a
newline; without this flag, ``'.'`` will match anything *except* a newline.
Corresponds to the inline flag ``(?s)``.
.. data:: U
UNICODE
In Python 2, this flag made :ref:`special sequences <re-special-sequences>`
include Unicode characters in matches. Since Python 3, Unicode characters
are matched by default.
In Python 3, Unicode characters are matched by default
for ``str`` patterns.
This flag is therefore redundant with **no effect**
and is only kept for backward compatibility.
See :const:`A` for restricting matching on ASCII characters instead.
This flag is only kept for backward compatibility.
See :py:const:`~re.ASCII` to restrict matching to ASCII characters instead.
.. data:: X
VERBOSE
@ -913,6 +964,8 @@ Functions
Empty matches for the pattern split the string only when not adjacent
to a previous empty match.
.. code:: pycon
>>> re.split(r'\b', 'Words, words, words.')
['', 'Words', ', ', 'words', ', ', 'words', '.']
>>> re.split(r'\W*', '...words...')
@ -1229,7 +1282,7 @@ Regular Expression Objects
The regex matching flags. This is a combination of the flags given to
:func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
flags such as :data:`UNICODE` if the pattern is a Unicode string.
flags such as :py:const:`~re.UNICODE` if the pattern is a Unicode string.
.. attribute:: Pattern.groups