From http://mail.python.org/pipermail/i18n-sig/2003-April/001557.html

- Expose NullTranslations and GNUTranslations to __all__ - Set the default charset to iso-8859-1. It used to be None, which would cause problems with .ugettext() if the file had no charset parameter. Arguably, the po/mo file would be broken, but I still think iso-8859-1 is a reasonable default. - Add a "coerce" default argument to GNUTranslations's constructor. The reason for this is that in Zope, we want all msgids and msgstrs to be Unicode. For the latter, we could use .ugettext() but there isn't currently a mechanism for Unicode-ifying msgids. The plan then is that the charset parameter specifies the encoding for both the msgids and msgstrs, and both are decoded to Unicode when read. For example, we might encode po files with utf-8. I think the GNU gettext tools don't care. Since this could potentially break code [*] that wants to use the encoded interface .gettext(), the constructor flag is added, defaulting to False. Most code I suspect will want to set this to True and use .ugettext(). - A few other minor changes from the Zope project, including asserting that a zero-length msgid must have a Project-ID-Version header for it to be counted as the metadata record.

From http://mail.python.org/pipermail/i18n-sig/2003-April/001557.html
- Expose NullTranslations and GNUTranslations to __all__ - Set the default charset to iso-8859-1. It used to be None, which would cause problems with .ugettext() if the file had no charset parameter. Arguably, the po/mo file would be broken, but I still think iso-8859-1 is a reasonable default. - Add a "coerce" default argument to GNUTranslations's constructor. The reason for this is that in Zope, we want all msgids and msgstrs to be Unicode. For the latter, we could use .ugettext() but there isn't currently a mechanism for Unicode-ifying msgids. The plan then is that the charset parameter specifies the encoding for both the msgids and msgstrs, and both are decoded to Unicode when read. For example, we might encode po files with utf-8. I think the GNU gettext tools don't care. Since this could potentially break code [*] that wants to use the encoded interface .gettext(), the constructor flag is added, defaulting to False. Most code I suspect will want to set this to True and use .ugettext(). - A few other minor changes from the Zope project, including asserting that a zero-length msgid must have a Project-ID-Version header for it to be counted as the metadata record.
a1ce93f8 · Barry Warsaw · de354b74 · a1ce93f8 · a1ce93f8 · a1ce93f8
Kaydet (Commit) a1ce93f8 authored Nis 11, 2003 tarafından Barry Warsaw
Expand all Hide whitespace changes
Inline Side-by-side

Showing with 58 additions and 21 deletions

libgettext.tex Doc/lib/libgettext.tex +28 -8

gettext.py Lib/gettext.py +30 -13

test_gettext.py Lib/test/test_gettext.py +0 -0

No files found.
--- a/Doc/lib/libgettext.tex
+++ b/Doc/lib/libgettext.tex
@@ -285,13 +285,17 @@ The \module{gettext} module provides one additional class derived from
 \class{NullTranslations}: \class{GNUTranslations}.  This class
 overrides \method{_parse()} to enable reading GNU \program{gettext}
 format \file{.mo} files in both big-endian and little-endian format.
-
-It also parses optional meta-data out of the translation catalog.  It
-is convention with GNU \program{gettext} to include meta-data as the
-translation for the empty string.  This meta-data is in \rfc{822}-style
-\code{key: value} pairs.  If the key \code{Content-Type} is found,
-then the \code{charset} property is used to initialize the
-``protected'' \member{_charset} instance variable.  The entire set of
+It also adds the ability to coerce both message ids and message
+strings to Unicode.
+
+\class{GNUTranslations} parses optional meta-data out of the
+translation catalog.  It is convention with GNU \program{gettext} to
+include meta-data as the translation for the empty string.  This
+meta-data is in \rfc{822}-style \code{key: value} pairs, and must
+contain the \code{Project-Id-Version}.  If the key
+\code{Content-Type} is found, then the \code{charset} property is used
+to initialize the ``protected'' \member{_charset} instance variable,
+defaulting to \code{iso-8859-1} if not found.  The entire set of
 key/value pairs are placed into a dictionary and set as the
 ``protected'' \member{_info} instance variable.

@@ -302,11 +306,27 @@ can raise \exception{IOError}.
 The other usefully overridden method is \method{ugettext()}, which
 returns a Unicode string by passing both the translated message string
 and the value of the ``protected'' \member{_charset} variable to the
-builtin \function{unicode()} function.
+builtin \function{unicode()} function.  Note that if you use
+\method{ugettext()} you probably also want your message ids to be
+Unicode.  To do this, set the variable \var{coerce} to \code{True} in
+the \class{GNUTranslations} constructor.  This ensures that both the
+message ids and message strings are decoded to Unicode when the file
+is read, using the file's \code{charset} value.  If you do this, you
+will not want to use the \method{gettext()} method -- always use
+\method{ugettext()} instead.

 To facilitate plural forms, the methods \method{ngettext} and
 \method{ungettext} are overridden as well.

+\begin{methoddesc}[GNUTranslations]{__init__}{
+    \optional{fp\optional{, coerce}}
+Constructs and parses a translation catalog in GNU gettext format.
+\var{fp} is passed to the base class (\class{NullTranslations})
+constructor.  \var{coerce} is a flag specifying whether message ids
+and message strings should be converted to Unicode when the file is
+parsed.  It defaults to \code{False} for backward compatibility.
+\end{methoddesc}
+
 \subsubsection{Solaris message catalog support}

 The Solaris operating system defines its own binary

--- a/Lib/gettext.py
+++ b/Lib/gettext.py
@@ -50,8 +50,10 @@ import copy, os, re, struct, sys
 from errno import ENOENT


-__all__ = ["bindtextdomain","textdomain","gettext","dgettext",
-           "find","translation","install","Catalog"]
+__all__ = ['NullTranslations', 'GNUTranslations', 'Catalog',
+           'find', 'translation', 'install', 'textdomain', 'bindtextdomain',
+           'dgettext', 'dngettext', 'gettext', 'ngettext',
+           ]

 _default_localedir = os.path.join(sys.prefix, 'share', 'locale')

@@ -170,7 +172,7 @@ def _expand_lang(locale):
 class NullTranslations:
    def __init__(self, fp=None):
        self._info = {}
-        self._charset = None
+        self._charset = 'iso-8859-1'
        self._fallback = None
        if fp is not None:
            self._parse(fp)
@@ -226,6 +228,12 @@ class GNUTranslations(NullTranslations):
    LE_MAGIC = 0x950412deL
    BE_MAGIC = 0xde120495L

+    def __init__(self, fp=None, coerce=False):
+        # Set this attribute before calling the base class constructor, since
+        # the latter calls _parse() which depends on self._coerce.
+        self._coerce = coerce
+        NullTranslations.__init__(self, fp)
+
    def _parse(self, fp):
        """Override this method to support alternative .mo formats."""
        unpack = struct.unpack
@@ -260,16 +268,22 @@ class GNUTranslations(NullTranslations):
                    # Plural forms
                    msgid1, msgid2 = msg.split('\x00')
                    tmsg = tmsg.split('\x00')
+                    if self._coerce:
+                        msgid1 = unicode(msgid1, self._charset)
+                        tmsg = [unicode(x, self._charset) for x in tmsg]
                    for i in range(len(tmsg)):
                        catalog[(msgid1, i)] = tmsg[i]
                else:
+                    if self._coerce:
+                        msg = unicode(msg, self._charset)
+                        tmsg = unicode(tmsg, self._charset)
                    catalog[msg] = tmsg
            else:
                raise IOError(0, 'File is corrupt', filename)
            # See if we're looking at GNU .mo conventions for metadata
-            if mlen == 0:
+            if mlen == 0 and tmsg.lower().startswith('project-id-version:'):
                # Catalog description
-                for item in tmsg.split('\n'):
+                for item in tmsg.splitlines():
                    item = item.strip()
                    if not item:
                        continue
@@ -297,7 +311,6 @@ class GNUTranslations(NullTranslations):
                return self._fallback.gettext(message)
            return message

-
    def ngettext(self, msgid1, msgid2, n):
        try:
            return self._catalog[(msgid1, self.plural(n))]
@@ -309,16 +322,17 @@ class GNUTranslations(NullTranslations):
            else:
                return msgid2

-
    def ugettext(self, message):
-        try:
-            tmsg = self._catalog[message]
-        except KeyError:
+        missing = object()
+        tmsg = self._catalog.get(message, missing)
+        if tmsg is missing:
            if self._fallback:
                return self._fallback.ugettext(message)
            tmsg = message
-        return unicode(tmsg, self._charset)
-
+        if not self._coerce:
+            return unicode(tmsg, self._charset)
+        # The msgstr is already coerced to Unicode
+        return tmsg

    def ungettext(self, msgid1, msgid2, n):
        try:
@@ -330,7 +344,10 @@ class GNUTranslations(NullTranslations):
                tmsg = msgid1
            else:
                tmsg = msgid2
-        return unicode(tmsg, self._charset)
+        if not self._coerce:
+            return unicode(tmsg, self._charset)
+        # The msgstr is already coerced to Unicode
+        return tmsg


 # Locate a .mo file using the gettext strategy

--- a/Lib/test/test_gettext.py
+++ b/Lib/test/test_gettext.py