Kaydet (Commit) 216ad337 authored tarafından Mark Summerfield's avatar Mark Summerfield

Added a note in each regarding the fact that unicode strings that look the same

may not compare equal (due to the possibility of multiple representations).
üst 5c404aed
......@@ -107,7 +107,7 @@ the following functions:
based on the definition of canonical equivalence and compatibility equivalence.
In Unicode, several characters can be expressed in various way. For example, the
character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as
the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
the sequence U+0327 (COMBINING CEDILLA) U+0043 (LATIN CAPITAL LETTER C).
For each character, there are two normal forms: normal form C and normal form D.
Normal form D (NFD) is also known as canonical decomposition, and translates
......@@ -126,6 +126,10 @@ the following functions:
(NFKC) first applies the compatibility decomposition, followed by the canonical
composition.
Even if two unicode strings are normalized and look the same to
a human reader, if one has combining characters and the other
doesn't, they may not compare equal.
.. versionadded:: 2.3
In addition, the module exposes the following constant:
......
......@@ -1040,7 +1040,7 @@ Comparison of objects of the same type depends on the type:
* Strings are compared lexicographically using the numeric equivalents (the
result of the built-in function :func:`ord`) of their characters. Unicode and
8-bit strings are fully interoperable in this behavior.
8-bit strings are fully interoperable in this behavior. [#]_
* Tuples and lists are compared lexicographically using comparison of
corresponding elements. This means that to compare equal, each element must
......@@ -1328,6 +1328,12 @@ groups from right to left).
cases, Python returns the latter result, in order to preserve that
``divmod(x,y)[0] * y + x % y`` be very close to ``x``.
.. [#] While comparisons between unicode strings make sense at the byte
level, they may be counter-intuitive to users. For example, the
strings ``u"\u00C7"`` and ``u"\u0327\u0043"`` compare differently,
even though they both represent the same unicode character (LATIN
CAPTITAL LETTER C WITH CEDILLA).
.. [#] The implementation computes this efficiently, without constructing lists or
sorting.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment