unicode.txt 17 KB
Newer Older
1 2 3
============
Unicode data
============
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Django natively supports Unicode data everywhere. Providing your database can
somehow store the data, you can safely pass around Unicode strings to
templates, models and the database.

This document tells you what you need to know if you're writing applications
that use data or templates that are encoded in something other than ASCII.

Creating the database
=====================

Make sure your database is configured to be able to store arbitrary string
data. Normally, this means giving it an encoding of UTF-8 or UTF-16. If you use
a more restrictive encoding -- for example, latin1 (iso8859-1) -- you won't be
able to store certain characters in the database, and information will be lost.

20 21
* MySQL users, refer to the `MySQL manual`_ (section 9.1.3.2 for MySQL 5.1)
  for details on how to set or alter the database character set encoding.
22

23 24
* PostgreSQL users, refer to the `PostgreSQL manual`_ (section 22.3.2 in
  PostgreSQL 9) for details on creating databases with the correct encoding.
25

26 27
* SQLite users, there is nothing you need to do. SQLite always uses UTF-8
  for internal encoding.
28

29
.. _MySQL manual: http://dev.mysql.com/doc/refman/5.1/en/charset-database.html
30
.. _PostgreSQL manual: http://www.postgresql.org/docs/current/static/multibyte.html
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

All of Django's database backends automatically convert Unicode strings into
the appropriate encoding for talking to the database. They also automatically
convert strings retrieved from the database into Python Unicode strings. You
don't even need to tell Django what encoding your database uses: that is
handled transparently.

For more, see the section "The database API" below.

General string handling
=======================

Whenever you use strings with Django -- e.g., in database lookups, template
rendering or anywhere else -- you have two choices for encoding those strings.
You can use Unicode strings, or you can use normal strings (sometimes called
"bytestrings") that are encoded using UTF-8.

48 49
.. versionchanged:: 1.5

50 51 52 53 54 55
    In Python 3, the logic is reversed, that is normal strings are Unicode, and
    when you want to specifically create a bytestring, you have to prefix the
    string with a 'b'. As we are doing in Django code from version 1.5,
    we recommend that you import ``unicode_literals`` from the __future__ library
    in your code. Then, when you specifically want to create a bytestring literal,
    prefix the string with 'b'.
56

57
    Python 2 legacy::
58

59 60
        my_string = "This is a bytestring"
        my_unicode = u"This is an Unicode string"
61

62
    Python 2 with unicode literals or Python 3::
63

64
        from __future__ import unicode_literals
65

66 67
        my_string = b"This is a bytestring"
        my_unicode = "This is an Unicode string"
68

69
    See also :doc:`Python 3 compatibility </topics/python3>`.
70

71
.. warning::
72

73 74 75 76 77 78 79 80 81 82 83
    A bytestring does not carry any information with it about its encoding.
    For that reason, we have to make an assumption, and Django assumes that all
    bytestrings are in UTF-8.

    If you pass a string to Django that has been encoded in some other format,
    things will go wrong in interesting ways. Usually, Django will raise a
    ``UnicodeDecodeError`` at some point.

If your code only uses ASCII data, it's safe to use your normal strings,
passing them around at will, because ASCII is a subset of UTF-8.

84
Don't be fooled into thinking that if your :setting:`DEFAULT_CHARSET` setting is set
85
to something other than ``'utf-8'`` you can use that other encoding in your
86
bytestrings! :setting:`DEFAULT_CHARSET` only applies to the strings generated as
87
the result of template rendering (and email). Django will always assume UTF-8
88
encoding for internal bytestrings. The reason for this is that the
89
:setting:`DEFAULT_CHARSET` setting is not actually under your control (if you are the
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
application developer). It's under the control of the person installing and
using your application -- and if that person chooses a different setting, your
code must still continue to work. Ergo, it cannot rely on that setting.

In most cases when Django is dealing with strings, it will convert them to
Unicode strings before doing anything else. So, as a general rule, if you pass
in a bytestring, be prepared to receive a Unicode string back in the result.

Translated strings
------------------

Aside from Unicode strings and bytestrings, there's a third type of string-like
object you may encounter when using Django. The framework's
internationalization features introduce the concept of a "lazy translation" --
a string that has been marked as translated but whose actual translation result
isn't determined until the object is used in a string. This feature is useful
in cases where the translation locale is unknown until the string is used, even
though the string might have originally been created when the code was first
imported.

Normally, you won't have to worry about lazy translations. Just be aware that
if you examine an object and it claims to be a
``django.utils.functional.__proxy__`` object, it is a lazy translation.
Calling ``unicode()`` with the lazy translation as the argument will generate a
Unicode string in the current locale.

For more details about lazy translation objects, refer to the
117
:doc:`internationalization </topics/i18n/index>` documentation.
118 119 120 121 122 123 124 125 126 127 128 129 130 131

Useful utility functions
------------------------

Because some string operations come up again and again, Django ships with a few
useful functions that should make working with Unicode and bytestring objects
a bit easier.

Conversion functions
~~~~~~~~~~~~~~~~~~~~

The ``django.utils.encoding`` module contains a few functions that are handy
for converting back and forth between Unicode and bytestrings.

132
* ``smart_text(s, encoding='utf-8', strings_only=False, errors='strict')``
133 134 135 136 137 138 139 140 141
  converts its input to a Unicode string. The ``encoding`` parameter
  specifies the input encoding. (For example, Django uses this internally
  when processing form input data, which might not be UTF-8 encoded.) The
  ``strings_only`` parameter, if set to True, will result in Python
  numbers, booleans and ``None`` not being converted to a string (they keep
  their original types). The ``errors`` parameter takes any of the values
  that are accepted by Python's ``unicode()`` function for its error
  handling.

142
  If you pass ``smart_text()`` an object that has a ``__unicode__``
143 144
  method, it will use that method to do the conversion.

145 146
* ``force_text(s, encoding='utf-8', strings_only=False,
  errors='strict')`` is identical to ``smart_text()`` in almost all
147
  cases. The difference is when the first argument is a :ref:`lazy
148 149
  translation <lazy-translations>` instance. While ``smart_text()``
  preserves lazy translations, ``force_text()`` forces those objects to a
150
  Unicode string (causing the translation to occur). Normally, you'll want
151
  to use ``smart_text()``. However, ``force_text()`` is useful in
152 153 154
  template tags and filters that absolutely *must* have a string to work
  with, not just something that can be converted to a string.

155 156
* ``smart_bytes(s, encoding='utf-8', strings_only=False, errors='strict')``
  is essentially the opposite of ``smart_text()``. It forces the first
157
  argument to a bytestring. The ``strings_only`` parameter has the same
158
  behavior as for ``smart_text()`` and ``force_text()``. This is
159 160
  slightly different semantics from Python's builtin ``str()`` function,
  but the difference is needed in a few places within Django's internals.
161

162
Normally, you'll only need to use ``smart_text()``. Call it as early as
163 164 165
possible on any input data that might be either Unicode or a bytestring, and
from then on, you can treat the result as always being Unicode.

166 167
.. _uri-and-iri-handling:

168 169 170
URI and IRI handling
~~~~~~~~~~~~~~~~~~~~

171
Web frameworks have to deal with URLs (which are a type of IRI_). One
172 173
requirement of URLs is that they are encoded using only ASCII characters.
However, in an international environment, you might need to construct a
174
URL from an IRI_ -- very loosely speaking, a URI_ that can contain Unicode
175 176 177
characters. Quoting and converting an IRI to URI can be a little tricky, so
Django provides some assistance.

178 179
* The function ``django.utils.encoding.iri_to_uri()`` implements the
  conversion from IRI to URI as required by the specification (:rfc:`3987`).
180

181 182 183 184
* The functions ``django.utils.http.urlquote()`` and
  ``django.utils.http.urlquote_plus()`` are versions of Python's standard
  ``urllib.quote()`` and ``urllib.quote_plus()`` that work with non-ASCII
  characters. (The data is converted to UTF-8 prior to encoding.)
185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207

These two groups of functions have slightly different purposes, and it's
important to keep them straight. Normally, you would use ``urlquote()`` on the
individual portions of the IRI or URI path so that any reserved characters
such as '&' or '%' are correctly encoded. Then, you apply ``iri_to_uri()`` to
the full IRI and it converts any non-ASCII characters to the correct encoded
values.

.. note::
    Technically, it isn't correct to say that ``iri_to_uri()`` implements the
    full algorithm in the IRI specification. It doesn't (yet) perform the
    international domain name encoding portion of the algorithm.

The ``iri_to_uri()`` function will not change ASCII characters that are
otherwise permitted in a URL. So, for example, the character '%' is not
further encoded when passed to ``iri_to_uri()``. This means you can pass a
full URL to this function and it will not mess up the query string or anything
like that.

An example might clarify things here::

    >>> urlquote(u'Paris & Orléans')
    u'Paris%20%26%20Orl%C3%A9ans'
208
    >>> iri_to_uri(u'/favorites/François/%s' % urlquote('Paris & Orléans'))
209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264
    '/favorites/Fran%C3%A7ois/Paris%20%26%20Orl%C3%A9ans'

If you look carefully, you can see that the portion that was generated by
``urlquote()`` in the second example was not double-quoted when passed to
``iri_to_uri()``. This is a very important and useful feature. It means that
you can construct your IRI without worrying about whether it contains
non-ASCII characters and then, right at the end, call ``iri_to_uri()`` on the
result.

The ``iri_to_uri()`` function is also idempotent, which means the following is
always true::

    iri_to_uri(iri_to_uri(some_string)) = iri_to_uri(some_string)

So you can safely call it multiple times on the same IRI without risking
double-quoting problems.

.. _URI: http://www.ietf.org/rfc/rfc2396.txt
.. _IRI: http://www.ietf.org/rfc/rfc3987.txt

Models
======

Because all strings are returned from the database as Unicode strings, model
fields that are character based (CharField, TextField, URLField, etc) will
contain Unicode values when Django retrieves data from the database. This
is *always* the case, even if the data could fit into an ASCII bytestring.

You can pass in bytestrings when creating a model or populating a field, and
Django will convert it to Unicode when it needs to.

Choosing between ``__str__()`` and ``__unicode__()``
----------------------------------------------------

One consequence of using Unicode by default is that you have to take some care
when printing data from the model.

In particular, rather than giving your model a ``__str__()`` method, we
recommended you implement a ``__unicode__()`` method. In the ``__unicode__()``
method, you can quite safely return the values of all your fields without
having to worry about whether they fit into a bytestring or not. (The way
Python works, the result of ``__str__()`` is *always* a bytestring, even if you
accidentally try to return a Unicode object).

You can still create a ``__str__()`` method on your models if you want, of
course, but you shouldn't need to do this unless you have a good reason.
Django's ``Model`` base class automatically provides a ``__str__()``
implementation that calls ``__unicode__()`` and encodes the result into UTF-8.
This means you'll normally only need to implement a ``__unicode__()`` method
and let Django handle the coercion to a bytestring when required.

Taking care in ``get_absolute_url()``
-------------------------------------

URLs can only contain ASCII characters. If you're constructing a URL from
pieces of data that might be non-ASCII, be careful to encode the results in a
265 266
way that is suitable for a URL. The :func:`~django.core.urlresolvers.reverse`
function handles this for you automatically.
267

268 269
If you're constructing a URL manually (i.e., *not* using the ``reverse()``
function), you'll need to take care of the encoding yourself. In this case,
270 271 272 273 274 275 276 277 278 279 280 281 282 283 284
use the ``iri_to_uri()`` and ``urlquote()`` functions that were documented
above_. For example::

    from django.utils.encoding import iri_to_uri
    from django.utils.http import urlquote

    def get_absolute_url(self):
        url = u'/person/%s/?x=0&y=0' % urlquote(self.location)
        return iri_to_uri(url)

This function returns a correctly encoded URL even if ``self.location`` is
something like "Jack visited Paris & Orléans". (In fact, the ``iri_to_uri()``
call isn't strictly necessary in the above example, because all the
non-ASCII characters would have been removed in quoting in the first line.)

285
.. _above: `URI and IRI handling`_
286 287 288 289 290 291 292 293

The database API
================

You can pass either Unicode strings or UTF-8 bytestrings as arguments to
``filter()`` methods and the like in the database API. The following two
querysets are identical::

294 295 296
    from __future__ import unicode_literals

    qs = People.objects.filter(name__contains='Å')
297
    qs = People.objects.filter(name__contains=b'\xc3\x85') # UTF-8 encoding of Å
298 299 300 301 302 303

Templates
=========

You can use either Unicode or bytestrings when creating templates manually::

304 305 306 307
    from __future__ import unicode_literals
    from django.template import Template
    t1 = Template(b'This is a bytestring template.')
    t2 = Template('This is a Unicode template.')
308 309 310

But the common case is to read templates from the filesystem, and this creates
a slight complication: not all filesystems store their data encoded as UTF-8.
311
If your template files are not stored with a UTF-8 encoding, set the :setting:`FILE_CHARSET`
312
setting to the encoding of the files on disk. When Django reads in a template
313
file, it will convert the data from this encoding to Unicode. (:setting:`FILE_CHARSET`
314 315
is set to ``'utf-8'`` by default.)

316
The :setting:`DEFAULT_CHARSET` setting controls the encoding of rendered templates.
317 318 319 320 321 322 323
This is set to UTF-8 by default.

Template tags and filters
-------------------------

A couple of tips to remember when writing your own template tags and filters:

324 325
* Always return Unicode strings from a template tag's ``render()`` method
  and from template filters.
326

327
* Use ``force_text()`` in preference to ``smart_text()`` in these
328 329 330 331
  places. Tag rendering and filter calls occur as the template is being
  rendered, so there is no advantage to postponing the conversion of lazy
  translation objects into strings. It's easier to work solely with Unicode
  strings at that point.
332

333
Email
334
=====
335

336
Django's email framework (in ``django.core.mail``) supports Unicode
337
transparently. You can use Unicode data in the message bodies and any headers.
338 339
However, you're still obligated to respect the requirements of the email
specifications, so, for example, email addresses should use only ASCII
340 341
characters.

342
The following code example demonstrates that everything except email addresses
343 344
can be non-ASCII::

345
    from __future__ import unicode_literals
346 347
    from django.core.mail import EmailMessage

348 349
    subject = 'My visit to Sør-Trøndelag'
    sender = 'Arnbjörg Ráðormsdóttir <arnbjorg@example.com>'
350
    recipients = ['Fred <fred@example.com']
351
    body = '...'
352
    msg = EmailMessage(subject, body, sender, recipients)
353
    msg.attach("Une pièce jointe.pdf", "%PDF-1.4.%...", mimetype="application/pdf")
354
    msg.send()
355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370

Form submission
===============

HTML form submission is a tricky area. There's no guarantee that the
submission will include encoding information, which means the framework might
have to guess at the encoding of submitted data.

Django adopts a "lazy" approach to decoding form data. The data in an
``HttpRequest`` object is only decoded when you access it. In fact, most of
the data is not decoded at all. Only the ``HttpRequest.GET`` and
``HttpRequest.POST`` data structures have any decoding applied to them. Those
two fields will return their members as Unicode data. All other attributes and
methods of ``HttpRequest`` return data exactly as it was submitted by the
client.

371
By default, the :setting:`DEFAULT_CHARSET` setting is used as the assumed encoding
372
for form data. If you need to change this for a particular form, you can set
373
the ``encoding`` attribute on an ``HttpRequest`` instance. For example::
374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389

    def some_view(request):
        # We know that the data must be encoded as KOI8-R (for some reason).
        request.encoding = 'koi8-r'
        ...

You can even change the encoding after having accessed ``request.GET`` or
``request.POST``, and all subsequent accesses will use the new encoding.

Most developers won't need to worry about changing form encoding, but this is
a useful feature for applications that talk to legacy systems whose encoding
you cannot control.

Django does not decode the data of file uploads, because that data is normally
treated as collections of bytes, rather than strings. Any automatic decoding
there would alter the meaning of the stream of bytes.