mirror of https://github.com/python/cpython.git
[3.11] GH-107678: Improve Unicode handling clarity in ``library/re.rst`` (GH-107679) (#113966)
GH-107678: Improve Unicode handling clarity in ``library/re.rst`` (GH-107679)
(cherry picked from commit c9b8a22f34
)
Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
This commit is contained in:
parent
71288ed8b0
commit
535b1dedb3
|
@ -17,7 +17,7 @@ those found in Perl.
|
|||
Both patterns and strings to be searched can be Unicode strings (:class:`str`)
|
||||
as well as 8-bit strings (:class:`bytes`).
|
||||
However, Unicode strings and 8-bit strings cannot be mixed:
|
||||
that is, you cannot match a Unicode string with a byte pattern or
|
||||
that is, you cannot match a Unicode string with a bytes pattern or
|
||||
vice-versa; similarly, when asking for a substitution, the replacement
|
||||
string must be of the same type as both the pattern and the search string.
|
||||
|
||||
|
@ -257,8 +257,7 @@ The special characters are:
|
|||
.. index:: single: \ (backslash); in regular expressions
|
||||
|
||||
* Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
|
||||
inside a set, although the characters they match depends on whether
|
||||
:const:`ASCII` or :const:`LOCALE` mode is in force.
|
||||
inside a set, although the characters they match depend on the flags_ used.
|
||||
|
||||
.. index:: single: ^ (caret); in regular expressions
|
||||
|
||||
|
@ -326,18 +325,24 @@ The special characters are:
|
|||
currently supported extensions.
|
||||
|
||||
``(?aiLmsux)``
|
||||
(One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
|
||||
``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the
|
||||
letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
|
||||
:const:`re.I` (ignore case), :const:`re.L` (locale dependent),
|
||||
:const:`re.M` (multi-line), :const:`re.S` (dot matches all),
|
||||
:const:`re.U` (Unicode matching), and :const:`re.X` (verbose),
|
||||
for the entire regular expression.
|
||||
(One or more letters from the set
|
||||
``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``.)
|
||||
The group matches the empty string;
|
||||
the letters set the corresponding flags for the entire regular expression:
|
||||
|
||||
* :const:`re.A` (ASCII-only matching)
|
||||
* :const:`re.I` (ignore case)
|
||||
* :const:`re.L` (locale dependent)
|
||||
* :const:`re.M` (multi-line)
|
||||
* :const:`re.S` (dot matches all)
|
||||
* :const:`re.U` (Unicode matching)
|
||||
* :const:`re.X` (verbose)
|
||||
|
||||
(The flags are described in :ref:`contents-of-module-re`.)
|
||||
This is useful if you wish to include the flags as part of the
|
||||
regular expression, instead of passing a *flag* argument to the
|
||||
:func:`re.compile` function. Flags should be used first in the
|
||||
expression string.
|
||||
:func:`re.compile` function.
|
||||
Flags should be used first in the expression string.
|
||||
|
||||
.. versionchanged:: 3.11
|
||||
This construction can only be used at the start of the expression.
|
||||
|
@ -351,14 +356,20 @@ The special characters are:
|
|||
pattern.
|
||||
|
||||
``(?aiLmsux-imsx:...)``
|
||||
(Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
|
||||
``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by
|
||||
(Zero or more letters from the set
|
||||
``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``,
|
||||
optionally followed by ``'-'`` followed by
|
||||
one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
|
||||
The letters set or remove the corresponding flags:
|
||||
:const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
|
||||
:const:`re.L` (locale dependent), :const:`re.M` (multi-line),
|
||||
:const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
|
||||
and :const:`re.X` (verbose), for the part of the expression.
|
||||
The letters set or remove the corresponding flags for the part of the expression:
|
||||
|
||||
* :const:`re.A` (ASCII-only matching)
|
||||
* :const:`re.I` (ignore case)
|
||||
* :const:`re.L` (locale dependent)
|
||||
* :const:`re.M` (multi-line)
|
||||
* :const:`re.S` (dot matches all)
|
||||
* :const:`re.U` (Unicode matching)
|
||||
* :const:`re.X` (verbose)
|
||||
|
||||
(The flags are described in :ref:`contents-of-module-re`.)
|
||||
|
||||
The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
|
||||
|
@ -366,7 +377,7 @@ The special characters are:
|
|||
when one of them appears in an inline group, it overrides the matching mode
|
||||
in the enclosing group. In Unicode patterns ``(?a:...)`` switches to
|
||||
ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
|
||||
(default). In byte pattern ``(?L:...)`` switches to locale depending
|
||||
(default). In bytes patterns ``(?L:...)`` switches to locale dependent
|
||||
matching, and ``(?a:...)`` switches to ASCII-only matching (default).
|
||||
This override is only in effect for the narrow inline group, and the
|
||||
original matching mode is restored outside of the group.
|
||||
|
@ -528,47 +539,61 @@ character ``'$'``.
|
|||
|
||||
``\b``
|
||||
Matches the empty string, but only at the beginning or end of a word.
|
||||
A word is defined as a sequence of word characters. Note that formally,
|
||||
``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
|
||||
(or vice versa), or between ``\w`` and the beginning/end of the string.
|
||||
This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
|
||||
``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
|
||||
A word is defined as a sequence of word characters.
|
||||
Note that formally, ``\b`` is defined as the boundary
|
||||
between a ``\w`` and a ``\W`` character (or vice versa),
|
||||
or between ``\w`` and the beginning or end of the string.
|
||||
This means that ``r'\bat\b'`` matches ``'at'``, ``'at.'``, ``'(at)'``,
|
||||
and ``'as at ay'`` but not ``'attempt'`` or ``'atlas'``.
|
||||
|
||||
By default Unicode alphanumerics are the ones used in Unicode patterns, but
|
||||
this can be changed by using the :const:`ASCII` flag. Word boundaries are
|
||||
determined by the current locale if the :const:`LOCALE` flag is used.
|
||||
Inside a character range, ``\b`` represents the backspace character, for
|
||||
compatibility with Python's string literals.
|
||||
The default word characters in Unicode (str) patterns
|
||||
are Unicode alphanumerics and the underscore,
|
||||
but this can be changed by using the :py:const:`~re.ASCII` flag.
|
||||
Word boundaries are determined by the current locale
|
||||
if the :py:const:`~re.LOCALE` flag is used.
|
||||
|
||||
.. note::
|
||||
|
||||
Inside a character range, ``\b`` represents the backspace character,
|
||||
for compatibility with Python's string literals.
|
||||
|
||||
.. index:: single: \B; in regular expressions
|
||||
|
||||
``\B``
|
||||
Matches the empty string, but only when it is *not* at the beginning or end
|
||||
of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
|
||||
``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
|
||||
``\B`` is just the opposite of ``\b``, so word characters in Unicode
|
||||
patterns are Unicode alphanumerics or the underscore, although this can
|
||||
be changed by using the :const:`ASCII` flag. Word boundaries are
|
||||
determined by the current locale if the :const:`LOCALE` flag is used.
|
||||
Matches the empty string,
|
||||
but only when it is *not* at the beginning or end of a word.
|
||||
This means that ``r'at\B'`` matches ``'athens'``, ``'atom'``,
|
||||
``'attorney'``, but not ``'at'``, ``'at.'``, or ``'at!'``.
|
||||
``\B`` is the opposite of ``\b``,
|
||||
so word characters in Unicode (str) patterns
|
||||
are Unicode alphanumerics or the underscore,
|
||||
although this can be changed by using the :py:const:`~re.ASCII` flag.
|
||||
Word boundaries are determined by the current locale
|
||||
if the :py:const:`~re.LOCALE` flag is used.
|
||||
|
||||
.. index:: single: \d; in regular expressions
|
||||
|
||||
``\d``
|
||||
For Unicode (str) patterns:
|
||||
Matches any Unicode decimal digit (that is, any character in
|
||||
Unicode character category [Nd]). This includes ``[0-9]``, and
|
||||
also many other digit characters. If the :const:`ASCII` flag is
|
||||
used only ``[0-9]`` is matched.
|
||||
Matches any Unicode decimal digit
|
||||
(that is, any character in Unicode character category `[Nd]`__).
|
||||
This includes ``[0-9]``, and also many other digit characters.
|
||||
|
||||
Matches ``[0-9]`` if the :py:const:`~re.ASCII` flag is used.
|
||||
|
||||
__ https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf#G134153
|
||||
|
||||
For 8-bit (bytes) patterns:
|
||||
Matches any decimal digit; this is equivalent to ``[0-9]``.
|
||||
Matches any decimal digit in the ASCII character set;
|
||||
this is equivalent to ``[0-9]``.
|
||||
|
||||
.. index:: single: \D; in regular expressions
|
||||
|
||||
``\D``
|
||||
Matches any character which is not a decimal digit. This is
|
||||
the opposite of ``\d``. If the :const:`ASCII` flag is used this
|
||||
becomes the equivalent of ``[^0-9]``.
|
||||
Matches any character which is not a decimal digit.
|
||||
This is the opposite of ``\d``.
|
||||
|
||||
Matches ``[^0-9]`` if the :py:const:`~re.ASCII` flag is used.
|
||||
|
||||
.. index:: single: \s; in regular expressions
|
||||
|
||||
|
@ -577,8 +602,9 @@ character ``'$'``.
|
|||
Matches Unicode whitespace characters (which includes
|
||||
``[ \t\n\r\f\v]``, and also many other characters, for example the
|
||||
non-breaking spaces mandated by typography rules in many
|
||||
languages). If the :const:`ASCII` flag is used, only
|
||||
``[ \t\n\r\f\v]`` is matched.
|
||||
languages).
|
||||
|
||||
Matches ``[ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used.
|
||||
|
||||
For 8-bit (bytes) patterns:
|
||||
Matches characters considered whitespace in the ASCII character set;
|
||||
|
@ -588,30 +614,39 @@ character ``'$'``.
|
|||
|
||||
``\S``
|
||||
Matches any character which is not a whitespace character. This is
|
||||
the opposite of ``\s``. If the :const:`ASCII` flag is used this
|
||||
becomes the equivalent of ``[^ \t\n\r\f\v]``.
|
||||
the opposite of ``\s``.
|
||||
|
||||
Matches ``[^ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used.
|
||||
|
||||
.. index:: single: \w; in regular expressions
|
||||
|
||||
``\w``
|
||||
For Unicode (str) patterns:
|
||||
Matches Unicode word characters; this includes alphanumeric characters (as defined by :meth:`str.isalnum`)
|
||||
Matches Unicode word characters;
|
||||
this includes all Unicode alphanumeric characters
|
||||
(as defined by :py:meth:`str.isalnum`),
|
||||
as well as the underscore (``_``).
|
||||
If the :const:`ASCII` flag is used, only ``[a-zA-Z0-9_]`` is matched.
|
||||
|
||||
Matches ``[a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used.
|
||||
|
||||
For 8-bit (bytes) patterns:
|
||||
Matches characters considered alphanumeric in the ASCII character set;
|
||||
this is equivalent to ``[a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
|
||||
used, matches characters considered alphanumeric in the current locale
|
||||
and the underscore.
|
||||
this is equivalent to ``[a-zA-Z0-9_]``.
|
||||
If the :py:const:`~re.LOCALE` flag is used,
|
||||
matches characters considered alphanumeric in the current locale and the underscore.
|
||||
|
||||
.. index:: single: \W; in regular expressions
|
||||
|
||||
``\W``
|
||||
Matches any character which is not a word character. This is
|
||||
the opposite of ``\w``. If the :const:`ASCII` flag is used this
|
||||
becomes the equivalent of ``[^a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
|
||||
used, matches characters which are neither alphanumeric in the current locale
|
||||
Matches any character which is not a word character.
|
||||
This is the opposite of ``\w``.
|
||||
By default, matches non-underscore (``_``) characters
|
||||
for which :py:meth:`str.isalnum` returns ``False``.
|
||||
|
||||
Matches ``[^a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used.
|
||||
|
||||
If the :py:const:`~re.LOCALE` flag is used,
|
||||
matches characters which are neither alphanumeric in the current locale
|
||||
nor the underscore.
|
||||
|
||||
.. index:: single: \Z; in regular expressions
|
||||
|
@ -643,9 +678,11 @@ accepted by the regular expression parser::
|
|||
(Note that ``\b`` is used to represent word boundaries, and means "backspace"
|
||||
only inside character classes.)
|
||||
|
||||
``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode
|
||||
patterns. In bytes patterns they are errors. Unknown escapes of ASCII
|
||||
letters are reserved for future use and treated as errors.
|
||||
``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are
|
||||
only recognized in Unicode (str) patterns.
|
||||
In bytes patterns they are errors.
|
||||
Unknown escapes of ASCII letters are reserved
|
||||
for future use and treated as errors.
|
||||
|
||||
Octal escapes are included in a limited form. If the first digit is a 0, or if
|
||||
there are three octal digits, it is considered an octal escape. Otherwise, it is
|
||||
|
@ -693,30 +730,37 @@ Flags
|
|||
|
||||
Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
|
||||
perform ASCII-only matching instead of full Unicode matching. This is only
|
||||
meaningful for Unicode patterns, and is ignored for byte patterns.
|
||||
meaningful for Unicode (str) patterns, and is ignored for bytes patterns.
|
||||
|
||||
Corresponds to the inline flag ``(?a)``.
|
||||
|
||||
Note that for backward compatibility, the :const:`re.U` flag still
|
||||
exists (as well as its synonym :const:`re.UNICODE` and its embedded
|
||||
counterpart ``(?u)``), but these are redundant in Python 3 since
|
||||
matches are Unicode by default for strings (and Unicode matching
|
||||
isn't allowed for bytes).
|
||||
.. note::
|
||||
|
||||
The :py:const:`~re.U` flag still exists for backward compatibility,
|
||||
but is redundant in Python 3 since
|
||||
matches are Unicode by default for ``str`` patterns,
|
||||
and Unicode matching isn't allowed for bytes patterns.
|
||||
:py:const:`~re.UNICODE` and the inline flag ``(?u)`` are similarly redundant.
|
||||
|
||||
|
||||
.. data:: DEBUG
|
||||
|
||||
Display debug information about compiled expression.
|
||||
|
||||
No corresponding inline flag.
|
||||
|
||||
|
||||
.. data:: I
|
||||
IGNORECASE
|
||||
|
||||
Perform case-insensitive matching; expressions like ``[A-Z]`` will also
|
||||
match lowercase letters. Full Unicode matching (such as ``Ü`` matching
|
||||
``ü``) also works unless the :const:`re.ASCII` flag is used to disable
|
||||
non-ASCII matches. The current locale does not change the effect of this
|
||||
flag unless the :const:`re.LOCALE` flag is also used.
|
||||
Perform case-insensitive matching;
|
||||
expressions like ``[A-Z]`` will also match lowercase letters.
|
||||
Full Unicode matching (such as ``Ü`` matching ``ü``)
|
||||
also works unless the :py:const:`~re.ASCII` flag
|
||||
is used to disable non-ASCII matches.
|
||||
The current locale does not change the effect of this flag
|
||||
unless the :py:const:`~re.LOCALE` flag is also used.
|
||||
|
||||
Corresponds to the inline flag ``(?i)``.
|
||||
|
||||
Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
|
||||
|
@ -724,29 +768,35 @@ Flags
|
|||
letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
|
||||
letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
|
||||
'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
|
||||
If the :const:`ASCII` flag is used, only letters 'a' to 'z'
|
||||
If the :py:const:`~re.ASCII` flag is used, only letters 'a' to 'z'
|
||||
and 'A' to 'Z' are matched.
|
||||
|
||||
.. data:: L
|
||||
LOCALE
|
||||
|
||||
Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
|
||||
dependent on the current locale. This flag can be used only with bytes
|
||||
patterns. The use of this flag is discouraged as the locale mechanism
|
||||
is very unreliable, it only handles one "culture" at a time, and it only
|
||||
works with 8-bit locales. Unicode matching is already enabled by default
|
||||
in Python 3 for Unicode (str) patterns, and it is able to handle different
|
||||
locales/languages.
|
||||
dependent on the current locale.
|
||||
This flag can be used only with bytes patterns.
|
||||
|
||||
Corresponds to the inline flag ``(?L)``.
|
||||
|
||||
.. warning::
|
||||
|
||||
This flag is discouraged; consider Unicode matching instead.
|
||||
The locale mechanism is very unreliable
|
||||
as it only handles one "culture" at a time
|
||||
and only works with 8-bit locales.
|
||||
Unicode matching is enabled by default for Unicode (str) patterns
|
||||
and it is able to handle different locales and languages.
|
||||
|
||||
.. versionchanged:: 3.6
|
||||
:const:`re.LOCALE` can be used only with bytes patterns and is
|
||||
not compatible with :const:`re.ASCII`.
|
||||
:py:const:`~re.LOCALE` can be used only with bytes patterns
|
||||
and is not compatible with :py:const:`~re.ASCII`.
|
||||
|
||||
.. versionchanged:: 3.7
|
||||
Compiled regular expression objects with the :const:`re.LOCALE` flag no
|
||||
longer depend on the locale at compile time. Only the locale at
|
||||
matching time affects the result of matching.
|
||||
Compiled regular expression objects with the :py:const:`~re.LOCALE` flag
|
||||
no longer depend on the locale at compile time.
|
||||
Only the locale at matching time affects the result of matching.
|
||||
|
||||
|
||||
.. data:: M
|
||||
|
@ -758,6 +808,7 @@ Flags
|
|||
end of each line (immediately preceding each newline). By default, ``'^'``
|
||||
matches only at the beginning of the string, and ``'$'`` only at the end of the
|
||||
string and immediately before the newline (if any) at the end of the string.
|
||||
|
||||
Corresponds to the inline flag ``(?m)``.
|
||||
|
||||
.. data:: NOFLAG
|
||||
|
@ -777,19 +828,19 @@ Flags
|
|||
|
||||
Make the ``'.'`` special character match any character at all, including a
|
||||
newline; without this flag, ``'.'`` will match anything *except* a newline.
|
||||
|
||||
Corresponds to the inline flag ``(?s)``.
|
||||
|
||||
|
||||
.. data:: U
|
||||
UNICODE
|
||||
|
||||
In Python 2, this flag made :ref:`special sequences <re-special-sequences>`
|
||||
include Unicode characters in matches. Since Python 3, Unicode characters
|
||||
are matched by default.
|
||||
In Python 3, Unicode characters are matched by default
|
||||
for ``str`` patterns.
|
||||
This flag is therefore redundant with **no effect**
|
||||
and is only kept for backward compatibility.
|
||||
|
||||
See :const:`A` for restricting matching on ASCII characters instead.
|
||||
|
||||
This flag is only kept for backward compatibility.
|
||||
See :py:const:`~re.ASCII` to restrict matching to ASCII characters instead.
|
||||
|
||||
.. data:: X
|
||||
VERBOSE
|
||||
|
@ -913,6 +964,8 @@ Functions
|
|||
Empty matches for the pattern split the string only when not adjacent
|
||||
to a previous empty match.
|
||||
|
||||
.. code:: pycon
|
||||
|
||||
>>> re.split(r'\b', 'Words, words, words.')
|
||||
['', 'Words', ', ', 'words', ', ', 'words', '.']
|
||||
>>> re.split(r'\W*', '...words...')
|
||||
|
@ -1229,7 +1282,7 @@ Regular Expression Objects
|
|||
|
||||
The regex matching flags. This is a combination of the flags given to
|
||||
:func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
|
||||
flags such as :data:`UNICODE` if the pattern is a Unicode string.
|
||||
flags such as :py:const:`~re.UNICODE` if the pattern is a Unicode string.
|
||||
|
||||
|
||||
.. attribute:: Pattern.groups
|
||||
|
|
Loading…
Reference in New Issue