Kaydet (Commit) 6c4f6179 authored tarafından Mark Summerfield's avatar Mark Summerfield

Revised all texts concerning the ASCII flag: (1) put Unicode case first

(since that's the default), (2) made all descriptions consistent, (3)
dropped mention of re.LOCALE in most places since it is not recommended.
üst 5ef6d18b
...@@ -323,67 +323,78 @@ the second character. For example, ``\$`` matches the character ``'$'``. ...@@ -323,67 +323,78 @@ the second character. For example, ``\$`` matches the character ``'$'``.
Matches only at the start of the string. Matches only at the start of the string.
``\b`` ``\b``
Matches the empty string, but only at the beginning or end of a word. A word is Matches the empty string, but only at the beginning or end of a word.
defined as a sequence of alphanumeric or underscore characters, so the end of a A word is defined as a sequence of Unicode alphanumeric or underscore
word is indicated by whitespace or a non-alphanumeric, non-underscore character. characters, so the end of a word is indicated by whitespace or a
Note that ``\b`` is defined as the boundary between ``\w`` and ``\ W``, so the non-alphanumeric, non-underscore Unicode character. Note that
precise set of characters deemed to be alphanumeric depends on the values of the formally, ``\b`` is defined as the boundary between a ``\w`` and a
``ASCII`` and ``LOCALE`` flags. Inside a character range, ``\b`` represents ``\W`` character (or vice versa). By default Unicode alphanumerics
the backspace character, for compatibility with Python's string literals. are the ones used, but this can be changed by using the :const:`ASCII`
flag. Inside a character range, ``\b`` represents the backspace
character, for compatibility with Python's string literals.
``\B`` ``\B``
Matches the empty string, but only when it is *not* at the beginning or end of a Matches the empty string, but only when it is *not* at the beginning or end of a
word. This is just the opposite of ``\b``, so is also subject to the settings word. This is just the opposite of ``\b``, so word characters are
of ``ASCII`` and ``LOCALE`` . Unicode alphanumerics or the underscore, although this can be changed
by using the :const:`ASCII` flag.
``\d`` ``\d``
For Unicode (str) patterns: For Unicode (str) patterns:
When the :const:`ASCII` flag is specified, matches any decimal digit; this Matches any Unicode digit (which includes ``[0-9]``, and also many
is equivalent to the set ``[0-9]``. Otherwise, it will match whatever other digit characters). If the :const:`ASCII` flag is used only
is classified as a digit in the Unicode character properties database ``[0-9]`` is matched (but the flag affects the entire regular
(but this does include the standard ASCII digits and is thus a superset expression, so in such cases using an explicit ``[0-9]`` may be a
of [0-9]). better choice).
For 8-bit (bytes) patterns: For 8-bit (bytes) patterns:
Matches any decimal digit; this is equivalent to the set ``[0-9]``. Matches any decimal digit; this is equivalent to ``[0-9]``.
``\D`` ``\D``
Matches any character which is not a decimal digit. This is the Matches any character which is not a Unicode decimal digit. This is
opposite of ``\d`` and is therefore similarly subject to the settings of the opposite of ``\d``. If the :const:`ASCII` flag is used this
``ASCII`` and ``LOCALE``. becomes the equivalent of ``[^0-9]`` (but the flag affects the entire
regular expression, so in such cases using an explicit ``[^0-9]`` may
be a better choice).
``\s`` ``\s``
For Unicode (str) patterns: For Unicode (str) patterns:
When the :const:`ASCII` flag is specified, matches only ASCII whitespace Matches Unicode whitespace characters (which includes
characters; this is equivalent to the set ``[ \t\n\r\f\v]``. Otherwise, ``[ \t\n\r\f\v]``, and also many other characters, for example the
it will match this set whatever is classified as space in the Unicode non-breaking spaces mandated by typography rules in many
character properties database (including for example the non-breaking languages). If the :const:`ASCII` flag is used, only
spaces mandated by typography rules in many languages). ``[ \t\n\r\f\v]`` is matched (but the flag affects the entire
regular expression, so in such cases using an explicit
``[ \t\n\r\f\v]`` may be a better choice).
For 8-bit (bytes) patterns: For 8-bit (bytes) patterns:
Matches characters considered whitespace in the ASCII character set; Matches characters considered whitespace in the ASCII character set;
this is equivalent to the set ``[ \t\n\r\f\v]``. this is equivalent to ``[ \t\n\r\f\v]``.
``\S`` ``\S``
Matches any character which is not a whitespace character. This is the Matches any character which is not a Unicode whitespace character. This is
opposite of ``\s`` and is therefore similarly subject to the settings of the opposite of ``\s``. If the :const:`ASCII` flag is used this
``ASCII`` and ``LOCALE``. becomes the equivalent of ``[^ \t\n\r\f\v]`` (but the flag affects the entire
regular expression, so in such cases using an explicit ``[^ \t\n\r\f\v]`` may
be a better choice).
``\w`` ``\w``
For Unicode (str) patterns: For Unicode (str) patterns:
When the :const:`ASCII` flag is specified, this is equivalent to the set Matches Unicode word characters; this includes most characters
``[a-zA-Z0-9_]``. Otherwise, it will match whatever is classified as that can be part of a word in any language, as well as numbers and
alphanumeric in the Unicode character properties database (it will the underscore. If the :const:`ASCII` flag is used, only
include most characters that can be part of a word in whatever language, ``[a-zA-Z0-9_]`` is matched (but the flag affects the entire
as well as numbers and the underscore sign). regular expression, so in such cases using an explicit
``[a-zA-Z0-9_]`` may be a better choice).
For 8-bit (bytes) patterns: For 8-bit (bytes) patterns:
Matches characters considered alphanumeric in the ASCII character set; Matches characters considered alphanumeric in the ASCII character set;
this is equivalent to the set ``[a-zA-Z0-9_]``. With :const:`LOCALE`, this is equivalent to ``[a-zA-Z0-9_]``.
it will additionally match whatever characters are defined as
alphanumeric for the current locale.
``\W`` ``\W``
Matches any character which is not an alphanumeric character. This is the Matches any character which is not a Unicode word character. This is
opposite of ``\w`` and is therefore similarly subject to the settings of the opposite of ``\w``. If the :const:`ASCII` flag is used this
``ASCII`` and ``LOCALE``. becomes the equivalent of ``[^a-zA-Z0-9_]`` (but the flag affects the
entire regular expression, so in such cases using an explicit
``[^a-zA-Z0-9_]`` may be a better choice).
``\Z`` ``\Z``
Matches only at the end of the string. Matches only at the end of the string.
...@@ -471,16 +482,11 @@ form. ...@@ -471,16 +482,11 @@ form.
matching instead of full Unicode matching. This is only meaningful for matching instead of full Unicode matching. This is only meaningful for
Unicode patterns, and is ignored for byte patterns. Unicode patterns, and is ignored for byte patterns.
Note that the :const:`re.U` flag still exists (as well as its synonym Note that for backward compatibility, the :const:`re.U` flag still
:const:`re.UNICODE` and its embedded counterpart ``(?u)``), but it has exists (as well as its synonym :const:`re.UNICODE` and its embedded
become useless in Python 3.0. counterpart ``(?u)``), but these are redundant in Python 3.0 since
In previous Python versions, it was used to specify that matches are Unicode by default for strings (and Unicode matching
matching had to be Unicode dependent (the default was ASCII matching in isn't allowed for bytes).
all circumstances). Starting from Python 3.0, the default is Unicode
matching for Unicode strings (which can be changed by specifying the
``'a'`` flag), and ASCII matching for 8-bit strings. Further, Unicode
dependent matching for 8-bit strings isn't allowed anymore and results
in a ValueError.
.. data:: I .. data:: I
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment