rfc822.py 32.6 KB
Newer Older
1
"""RFC 2822 message manipulation.
2

3 4 5 6 7 8 9 10 11
Note: This is only a very rough sketch of a full RFC-822 parser; in particular
the tokenizing of addresses does not adhere to all the quoting rules.

Note: RFC 2822 is a long awaited update to RFC 822.  This module should
conform to RFC 2822, and is thus mis-named (it's not worth renaming it).  Some
effort at RFC 2822 updates have been made, but a thorough audit has not been
performed.  Consider any RFC 2822 non-conformance to be a bug.

    RFC 2822: http://www.faqs.org/rfcs/rfc2822.html
12
    RFC 822 : http://www.faqs.org/rfcs/rfc822.html (obsolete)
13 14 15 16

Directions for use:

To create a Message object: first open a file, e.g.:
17

18
  fp = open(file, 'r')
19

20
You can use any other legal way of getting an open file object, e.g. use
21 22 23
sys.stdin or call os.popen().  Then pass the open file object to the Message()
constructor:

24 25
  m = Message(fp)

26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
This class can work with any input object that supports a readline method.  If
the input object has seek and tell capability, the rewindbody method will
work; also illegal lines will be pushed back onto the input stream.  If the
input object lacks seek but has an `unread' method that can push back a line
of input, Message will use that to push back illegal lines.  Thus this class
can be used to parse messages coming from a buffered stream.

The optional `seekable' argument is provided as a workaround for certain stdio
libraries in which tell() discards buffered data before discovering that the
lseek() system call doesn't work.  For maximum portability, you should set the
seekable argument to zero to prevent that initial \code{tell} when passing in
an unseekable object such as a a file object created from a socket object.  If
it is 1 on entry -- which it is by default -- the tell() method of the open
file object is called once; if this raises an exception, seekable is reset to
0.  For other nonzero values of seekable, this test is not made.
41

42
To get the text of a particular header there are several methods:
43

44 45
  str = m.getheader(name)
  str = m.getrawheader(name)
46 47 48 49 50 51

where name is the name of the header, e.g. 'Subject'.  The difference is that
getheader() strips the leading and trailing whitespace, while getrawheader()
doesn't.  Both functions retain embedded whitespace (including newlines)
exactly as they are specified in the header, and leave the case of the text
unchanged.
52 53

For addresses and address lists there are functions
54 55

  realname, mailaddress = m.getaddr(name)
56
  list = m.getaddrlist(name)
57

58 59 60
where the latter returns a list of (realname, mailaddr) tuples.

There is also a method
61

62
  time = m.getdate(name)
63

64 65 66 67 68 69 70 71
which parses a Date-like field and returns a time-compatible tuple,
i.e. a tuple such as returned by time.localtime() or accepted by
time.mktime().

See the class definition for lower level access methods.

There are also some utility functions here.
"""
72
# Cleanup and extensions by Eric S. Raymond <esr@thyrsus.com>
73

74
import time
75

76
__all__ = ["Message","AddressList","parsedate","parsedate_tz","mktime_tz"]
77

78
_blanklines = ('\r\n', '\n')            # Optimization for islast()
79 80


81
class Message:
82
    """Represents a single RFC 2822-compliant message."""
Tim Peters's avatar
Tim Peters committed
83

84 85
    def __init__(self, fp, seekable = 1):
        """Initialize the class instance and read the headers."""
86 87 88 89 90
        if seekable == 1:
            # Exercise tell() to make sure it works
            # (and then assume seek() works, too)
            try:
                fp.tell()
91
            except (AttributeError, IOError):
92 93 94
                seekable = 0
            else:
                seekable = 1
95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
        self.fp = fp
        self.seekable = seekable
        self.startofheaders = None
        self.startofbody = None
        #
        if self.seekable:
            try:
                self.startofheaders = self.fp.tell()
            except IOError:
                self.seekable = 0
        #
        self.readheaders()
        #
        if self.seekable:
            try:
                self.startofbody = self.fp.tell()
            except IOError:
                self.seekable = 0
Tim Peters's avatar
Tim Peters committed
113

114 115 116 117 118
    def rewindbody(self):
        """Rewind the file to the start of the body (if seekable)."""
        if not self.seekable:
            raise IOError, "unseekable file"
        self.fp.seek(self.startofbody)
Tim Peters's avatar
Tim Peters committed
119

120 121
    def readheaders(self):
        """Read header lines.
Tim Peters's avatar
Tim Peters committed
122

123 124 125 126 127 128 129 130 131 132 133
        Read header lines up to the entirely blank line that terminates them.
        The (normally blank) line that ends the headers is skipped, but not
        included in the returned list.  If a non-header line ends the headers,
        (which is an error), an attempt is made to backspace over it; it is
        never included in the returned list.

        The variable self.status is set to the empty string if all went well,
        otherwise it is an error message.  The variable self.headers is a
        completely uninterpreted list of lines contained in the header (so
        printing them will reproduce the header exactly as it appears in the
        file).
134 135 136 137 138 139 140
        """
        self.dict = {}
        self.unixfrom = ''
        self.headers = list = []
        self.status = ''
        headerseen = ""
        firstline = 1
141 142 143 144 145
        startofline = unread = tell = None
        if hasattr(self.fp, 'unread'):
            unread = self.fp.unread
        elif self.seekable:
            tell = self.fp.tell
146
        while 1:
147
            if tell:
148 149 150 151 152
                try:
                    startofline = tell()
                except IOError:
                    startofline = tell = None
                    self.seekable = 0
153 154 155 156 157
            line = self.fp.readline()
            if not line:
                self.status = 'EOF in headers'
                break
            # Skip unix From name time lines
158
            if firstline and line.startswith('From '):
159 160 161
                self.unixfrom = self.unixfrom + line
                continue
            firstline = 0
162
            if headerseen and line[0] in ' \t':
163 164
                # It's a continuation line.
                list.append(line)
165 166
                x = (self.dict[headerseen] + "\n " + line.strip())
                self.dict[headerseen] = x.strip()
167
                continue
168
            elif self.iscomment(line):
169 170 171 172 173 174 175 176 177
                # It's a comment.  Ignore it.
                continue
            elif self.islast(line):
                # Note! No pushback here!  The delimiter line gets eaten.
                break
            headerseen = self.isheader(line)
            if headerseen:
                # It's a legal header line, save it.
                list.append(line)
178
                self.dict[headerseen] = line[len(headerseen)+1:].strip()
179
                continue
180
            else:
181 182
                # It's not a header line; throw it back and stop here.
                if not self.dict:
183 184
                    self.status = 'No headers'
                else:
185
                    self.status = 'Non-header line where header expected'
186
                # Try to undo the read.
187 188 189 190
                if unread:
                    unread(line)
                elif tell:
                    self.fp.seek(startofline)
191
                else:
192
                    self.status = self.status + '; bad seek'
193
                break
194 195 196 197 198

    def isheader(self, line):
        """Determine whether a given line is a legal header.

        This method should return the header name, suitably canonicalized.
199 200
        You may override this method in order to use Message parsing on tagged
        data in RFC 2822-like formats with special header formats.
201
        """
202
        i = line.find(':')
203
        if i > 0:
204
            return line[:i].lower()
205 206
        else:
            return None
Tim Peters's avatar
Tim Peters committed
207

208
    def islast(self, line):
209
        """Determine whether a line is a legal end of RFC 2822 headers.
Tim Peters's avatar
Tim Peters committed
210

211 212 213 214
        You may override this method if your application wants to bend the
        rules, e.g. to strip trailing whitespace, or to recognize MH template
        separators ('--------').  For convenience (e.g. for code reading from
        sockets) a line consisting of \r\n also matches.
215 216
        """
        return line in _blanklines
217 218 219 220

    def iscomment(self, line):
        """Determine whether a line should be skipped entirely.

221 222 223
        You may override this method in order to use Message parsing on tagged
        data in RFC 2822-like formats that support embedded comments or
        free-text data.
224
        """
225
        return False
Tim Peters's avatar
Tim Peters committed
226

227 228
    def getallmatchingheaders(self, name):
        """Find all header lines matching a given header name.
Tim Peters's avatar
Tim Peters committed
229

230 231 232 233 234
        Look through the list of headers and find all lines matching a given
        header name (and their continuation lines).  A list of the lines is
        returned, without interpretation.  If the header does not occur, an
        empty list is returned.  If the header occurs multiple times, all
        occurrences are returned.  Case is not important in the header name.
235
        """
236
        name = name.lower() + ':'
237 238 239 240
        n = len(name)
        list = []
        hit = 0
        for line in self.headers:
241
            if line[:n].lower() == name:
242
                hit = 1
243
            elif not line[:1].isspace():
244 245 246 247
                hit = 0
            if hit:
                list.append(line)
        return list
Tim Peters's avatar
Tim Peters committed
248

249 250
    def getfirstmatchingheader(self, name):
        """Get the first header line matching name.
Tim Peters's avatar
Tim Peters committed
251

252 253
        This is similar to getallmatchingheaders, but it returns only the
        first matching header (and its continuation lines).
254
        """
255
        name = name.lower() + ':'
256 257 258 259 260
        n = len(name)
        list = []
        hit = 0
        for line in self.headers:
            if hit:
261
                if not line[:1].isspace():
262
                    break
263
            elif line[:n].lower() == name:
264 265 266 267
                hit = 1
            if hit:
                list.append(line)
        return list
Tim Peters's avatar
Tim Peters committed
268

269 270
    def getrawheader(self, name):
        """A higher-level interface to getfirstmatchingheader().
Tim Peters's avatar
Tim Peters committed
271

272 273 274 275
        Return a string containing the literal text of the header but with the
        keyword stripped.  All leading, trailing and embedded whitespace is
        kept in the string, however.  Return None if the header does not
        occur.
276
        """
Tim Peters's avatar
Tim Peters committed
277

278 279 280 281
        list = self.getfirstmatchingheader(name)
        if not list:
            return None
        list[0] = list[0][len(name) + 1:]
282
        return ''.join(list)
Tim Peters's avatar
Tim Peters committed
283

284
    def getheader(self, name, default=None):
285
        """Get the header value for a name.
Tim Peters's avatar
Tim Peters committed
286

287 288 289
        This is the normal interface: it returns a stripped version of the
        header value for a given header name, or None if it doesn't exist.
        This uses the dictionary version which finds the *last* such header.
290 291
        """
        try:
292
            return self.dict[name.lower()]
293
        except KeyError:
294 295
            return default
    get = getheader
296 297 298 299

    def getheaders(self, name):
        """Get all values for a header.

300 301 302
        This returns a list of values for headers given more than once; each
        value in the result list is stripped in the same way as the result of
        getheader().  If the header is not given, return an empty list.
303 304 305 306 307
        """
        result = []
        current = ''
        have_header = 0
        for s in self.getallmatchingheaders(name):
308
            if s[0].isspace():
309
                if current:
310
                    current = "%s\n %s" % (current, s.strip())
311
                else:
312
                    current = s.strip()
313 314 315
            else:
                if have_header:
                    result.append(current)
316
                current = s[s.find(":") + 1:].strip()
317 318 319
                have_header = 1
        if have_header:
            result.append(current)
320
        return result
Tim Peters's avatar
Tim Peters committed
321

322 323
    def getaddr(self, name):
        """Get a single address from a header, as a tuple.
Tim Peters's avatar
Tim Peters committed
324

325 326 327 328 329 330 331 332 333
        An example return value:
        ('Guido van Rossum', 'guido@cwi.nl')
        """
        # New, by Ben Escoto
        alist = self.getaddrlist(name)
        if alist:
            return alist[0]
        else:
            return (None, None)
Tim Peters's avatar
Tim Peters committed
334

335 336
    def getaddrlist(self, name):
        """Get a list of addresses from a header.
337 338 339 340

        Retrieves a list of addresses from a header, where each address is a
        tuple as returned by getaddr().  Scans all named headers, so it works
        properly with multiple To: or Cc: headers for example.
341
        """
342 343
        raw = []
        for h in self.getallmatchingheaders(name):
344 345 346 347 348
            if h[0] in ' \t':
                raw.append(h)
            else:
                if raw:
                    raw.append(', ')
349
                i = h.find(':')
350 351 352
                if i > 0:
                    addr = h[i+1:]
                raw.append(addr)
353
        alladdrs = ''.join(raw)
354
        a = AddressList(alladdrs)
355
        return a.addresslist
Tim Peters's avatar
Tim Peters committed
356

357 358
    def getdate(self, name):
        """Retrieve a date field from a header.
Tim Peters's avatar
Tim Peters committed
359

360 361
        Retrieves a date field from the named header, returning a tuple
        compatible with time.mktime().
362 363 364 365 366 367
        """
        try:
            data = self[name]
        except KeyError:
            return None
        return parsedate(data)
Tim Peters's avatar
Tim Peters committed
368

369 370
    def getdate_tz(self, name):
        """Retrieve a date field from a header as a 10-tuple.
Tim Peters's avatar
Tim Peters committed
371

372 373
        The first 9 elements make up a tuple compatible with time.mktime(),
        and the 10th is the offset of the poster's time zone from GMT/UTC.
374 375 376 377 378 379
        """
        try:
            data = self[name]
        except KeyError:
            return None
        return parsedate_tz(data)
Tim Peters's avatar
Tim Peters committed
380 381


382
    # Access as a dictionary (only finds *last* header of each type):
Tim Peters's avatar
Tim Peters committed
383

384 385 386
    def __len__(self):
        """Get the number of headers in a message."""
        return len(self.dict)
Tim Peters's avatar
Tim Peters committed
387

388 389
    def __getitem__(self, name):
        """Get a specific header, as from a dictionary."""
390
        return self.dict[name.lower()]
391 392

    def __setitem__(self, name, value):
393 394
        """Set the value of a header.

395 396 397
        Note: This is not a perfect inversion of __getitem__, because any
        changed headers get stuck at the end of the raw-headers list rather
        than where the altered header was.
398
        """
399
        del self[name] # Won't fail if it doesn't exist
400
        self.dict[name.lower()] = value
401
        text = name + ": " + value
402
        lines = text.split("\n")
403 404
        for line in lines:
            self.headers.append(line + "\n")
Tim Peters's avatar
Tim Peters committed
405

406 407
    def __delitem__(self, name):
        """Delete all occurrences of a specific header, if it is present."""
408
        name = name.lower()
409
        if not name in self.dict:
410 411 412
            return
        del self.dict[name]
        name = name + ':'
413 414 415 416 417
        n = len(name)
        list = []
        hit = 0
        for i in range(len(self.headers)):
            line = self.headers[i]
418
            if line[:n].lower() == name:
419
                hit = 1
420
            elif not line[:1].isspace():
421 422 423 424 425 426 427
                hit = 0
            if hit:
                list.append(i)
        list.reverse()
        for i in list:
            del self.headers[i]

428
    def setdefault(self, name, default=""):
429
        lowername = name.lower()
430
        if lowername in self.dict:
431 432
            return self.dict[lowername]
        else:
433
            text = name + ": " + default
434 435 436
            lines = text.split("\n")
            for line in lines:
                self.headers.append(line + "\n")
437
            self.dict[lowername] = default
438 439
            return default

440 441
    def has_key(self, name):
        """Determine whether a message contains the named header."""
442 443 444 445
        return name.lower() in self.dict

    def __contains__(self, name):
        """Determine whether a message contains the named header."""
Tim Peters's avatar
Tim Peters committed
446
        return name.lower() in self.dict
Tim Peters's avatar
Tim Peters committed
447

448 449 450
    def keys(self):
        """Get all of a message's header field names."""
        return self.dict.keys()
Tim Peters's avatar
Tim Peters committed
451

452 453 454
    def values(self):
        """Get all of a message's header field values."""
        return self.dict.values()
Tim Peters's avatar
Tim Peters committed
455

456 457
    def items(self):
        """Get all of a message's headers.
Tim Peters's avatar
Tim Peters committed
458

459 460 461
        Returns a list of name, value tuples.
        """
        return self.dict.items()
462

463 464 465 466 467
    def __str__(self):
        str = ''
        for hdr in self.headers:
            str = str + hdr
        return str
468 469 470 471 472


# Utility functions
# -----------------

473
# XXX Should fix unquote() and quote() to be really conformant.
474 475
# XXX The inverses of the parse functions may also be useful.

476 477

def unquote(str):
478 479 480 481 482 483 484
    """Remove quotes from a string."""
    if len(str) > 1:
        if str[0] == '"' and str[-1:] == '"':
            return str[1:-1]
        if str[0] == '<' and str[-1:] == '>':
            return str[1:-1]
    return str
485

486 487

def quote(str):
488
    """Add quotes around a string."""
489
    return str.replace('\\', '\\\\').replace('"', '\\"')
490

491

492
def parseaddr(address):
493
    """Parse an address into a (realname, mailaddr) tuple."""
494
    a = AddressList(address)
495
    list = a.addresslist
496
    if not list:
497
        return (None, None)
498
    else:
499
        return list[0]
500 501 502


class AddrlistClass:
503
    """Address parser class by Ben Escoto.
Tim Peters's avatar
Tim Peters committed
504

505
    To understand what this class does, it helps to have a copy of
506 507 508
    RFC 2822 in front of you.

    http://www.faqs.org/rfcs/rfc2822.html
509 510 511

    Note: this class interface is deprecated and may be removed in the future.
    Use rfc822.AddressList instead.
512
    """
Tim Peters's avatar
Tim Peters committed
513

514
    def __init__(self, field):
515
        """Initialize a new instance.
Tim Peters's avatar
Tim Peters committed
516

517 518
        `field' is an unparsed address header field, containing one or more
        addresses.
519 520 521 522
        """
        self.specials = '()<>@,:;.\"[]'
        self.pos = 0
        self.LWS = ' \t'
523
        self.CR = '\r\n'
524
        self.atomends = self.specials + self.LWS + self.CR
525 526 527 528
        # Note that RFC 2822 now specifies `.' as obs-phrase, meaning that it
        # is obsolete syntax.  RFC 2822 requires that we recognize obsolete
        # syntax, so allow dots in phrases.
        self.phraseends = self.atomends.replace('.', '')
529 530
        self.field = field
        self.commentlist = []
Tim Peters's avatar
Tim Peters committed
531

532
    def gotonext(self):
533 534 535 536 537 538 539
        """Parse up to the start of the next address."""
        while self.pos < len(self.field):
            if self.field[self.pos] in self.LWS + '\n\r':
                self.pos = self.pos + 1
            elif self.field[self.pos] == '(':
                self.commentlist.append(self.getcomment())
            else: break
Tim Peters's avatar
Tim Peters committed
540

541
    def getaddrlist(self):
542
        """Parse all addresses.
Tim Peters's avatar
Tim Peters committed
543

544 545
        Returns a list containing all of the addresses.
        """
546 547 548 549 550 551 552 553
        result = []
        while 1:
            ad = self.getaddress()
            if ad:
                result += ad
            else:
                break
        return result
Tim Peters's avatar
Tim Peters committed
554

555
    def getaddress(self):
556 557 558
        """Parse the next address."""
        self.commentlist = []
        self.gotonext()
Tim Peters's avatar
Tim Peters committed
559

560 561 562
        oldpos = self.pos
        oldcl = self.commentlist
        plist = self.getphraselist()
Tim Peters's avatar
Tim Peters committed
563

564 565
        self.gotonext()
        returnlist = []
Tim Peters's avatar
Tim Peters committed
566

567 568 569
        if self.pos >= len(self.field):
            # Bad email address technically, no domain.
            if plist:
570
                returnlist = [(' '.join(self.commentlist), plist[0])]
Tim Peters's avatar
Tim Peters committed
571

572 573 574 575 576 577
        elif self.field[self.pos] in '.@':
            # email address is just an addrspec
            # this isn't very efficient since we start over
            self.pos = oldpos
            self.commentlist = oldcl
            addrspec = self.getaddrspec()
578
            returnlist = [(' '.join(self.commentlist), addrspec)]
Tim Peters's avatar
Tim Peters committed
579

580 581 582
        elif self.field[self.pos] == ':':
            # address is a group
            returnlist = []
Tim Peters's avatar
Tim Peters committed
583

584
            fieldlen = len(self.field)
585 586 587
            self.pos = self.pos + 1
            while self.pos < len(self.field):
                self.gotonext()
588
                if self.pos < fieldlen and self.field[self.pos] == ';':
589 590 591
                    self.pos = self.pos + 1
                    break
                returnlist = returnlist + self.getaddress()
Tim Peters's avatar
Tim Peters committed
592

593 594 595
        elif self.field[self.pos] == '<':
            # Address is a phrase then a route addr
            routeaddr = self.getrouteaddr()
Tim Peters's avatar
Tim Peters committed
596

597
            if self.commentlist:
598 599 600
                returnlist = [(' '.join(plist) + ' (' + \
                         ' '.join(self.commentlist) + ')', routeaddr)]
            else: returnlist = [(' '.join(plist), routeaddr)]
Tim Peters's avatar
Tim Peters committed
601

602 603
        else:
            if plist:
604
                returnlist = [(' '.join(self.commentlist), plist[0])]
605 606
            elif self.field[self.pos] in self.specials:
                self.pos = self.pos + 1
Tim Peters's avatar
Tim Peters committed
607

608 609 610 611
        self.gotonext()
        if self.pos < len(self.field) and self.field[self.pos] == ',':
            self.pos = self.pos + 1
        return returnlist
Tim Peters's avatar
Tim Peters committed
612

613
    def getrouteaddr(self):
614
        """Parse a route address (Return-path value).
Tim Peters's avatar
Tim Peters committed
615

616 617 618 619
        This method just skips all the route stuff and returns the addrspec.
        """
        if self.field[self.pos] != '<':
            return
Tim Peters's avatar
Tim Peters committed
620

621 622 623
        expectroute = 0
        self.pos = self.pos + 1
        self.gotonext()
624
        adlist = ""
625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641
        while self.pos < len(self.field):
            if expectroute:
                self.getdomain()
                expectroute = 0
            elif self.field[self.pos] == '>':
                self.pos = self.pos + 1
                break
            elif self.field[self.pos] == '@':
                self.pos = self.pos + 1
                expectroute = 1
            elif self.field[self.pos] == ':':
                self.pos = self.pos + 1
            else:
                adlist = self.getaddrspec()
                self.pos = self.pos + 1
                break
            self.gotonext()
Tim Peters's avatar
Tim Peters committed
642

643
        return adlist
Tim Peters's avatar
Tim Peters committed
644

645
    def getaddrspec(self):
646
        """Parse an RFC 2822 addr-spec."""
647
        aslist = []
Tim Peters's avatar
Tim Peters committed
648

649 650 651 652 653 654
        self.gotonext()
        while self.pos < len(self.field):
            if self.field[self.pos] == '.':
                aslist.append('.')
                self.pos = self.pos + 1
            elif self.field[self.pos] == '"':
Guido van Rossum's avatar
Guido van Rossum committed
655
                aslist.append('"%s"' % self.getquote())
656 657 658 659
            elif self.field[self.pos] in self.atomends:
                break
            else: aslist.append(self.getatom())
            self.gotonext()
Tim Peters's avatar
Tim Peters committed
660

661
        if self.pos >= len(self.field) or self.field[self.pos] != '@':
662
            return ''.join(aslist)
Tim Peters's avatar
Tim Peters committed
663

664 665 666
        aslist.append('@')
        self.pos = self.pos + 1
        self.gotonext()
667
        return ''.join(aslist) + self.getdomain()
Tim Peters's avatar
Tim Peters committed
668

669
    def getdomain(self):
670 671 672 673 674 675 676 677 678 679 680 681 682 683 684
        """Get the complete domain name from an address."""
        sdlist = []
        while self.pos < len(self.field):
            if self.field[self.pos] in self.LWS:
                self.pos = self.pos + 1
            elif self.field[self.pos] == '(':
                self.commentlist.append(self.getcomment())
            elif self.field[self.pos] == '[':
                sdlist.append(self.getdomainliteral())
            elif self.field[self.pos] == '.':
                self.pos = self.pos + 1
                sdlist.append('.')
            elif self.field[self.pos] in self.atomends:
                break
            else: sdlist.append(self.getatom())
685
        return ''.join(sdlist)
Tim Peters's avatar
Tim Peters committed
686

687
    def getdelimited(self, beginchar, endchars, allowcomments = 1):
688
        """Parse a header fragment delimited by special characters.
Tim Peters's avatar
Tim Peters committed
689

690 691 692
        `beginchar' is the start character for the fragment.  If self is not
        looking at an instance of `beginchar' then getdelimited returns the
        empty string.
Tim Peters's avatar
Tim Peters committed
693

694 695
        `endchars' is a sequence of allowable end-delimiting characters.
        Parsing stops when one of these is encountered.
Tim Peters's avatar
Tim Peters committed
696

697 698
        If `allowcomments' is non-zero, embedded RFC 2822 comments are allowed
        within the parsed fragment.
699 700 701
        """
        if self.field[self.pos] != beginchar:
            return ''
Tim Peters's avatar
Tim Peters committed
702

703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719
        slist = ['']
        quote = 0
        self.pos = self.pos + 1
        while self.pos < len(self.field):
            if quote == 1:
                slist.append(self.field[self.pos])
                quote = 0
            elif self.field[self.pos] in endchars:
                self.pos = self.pos + 1
                break
            elif allowcomments and self.field[self.pos] == '(':
                slist.append(self.getcomment())
            elif self.field[self.pos] == '\\':
                quote = 1
            else:
                slist.append(self.field[self.pos])
            self.pos = self.pos + 1
Tim Peters's avatar
Tim Peters committed
720

721
        return ''.join(slist)
Tim Peters's avatar
Tim Peters committed
722

723
    def getquote(self):
724 725
        """Get a quote-delimited fragment from self's field."""
        return self.getdelimited('"', '"\r', 0)
Tim Peters's avatar
Tim Peters committed
726

727
    def getcomment(self):
728 729
        """Get a parenthesis-delimited fragment from self's field."""
        return self.getdelimited('(', ')\r', 1)
Tim Peters's avatar
Tim Peters committed
730

731
    def getdomainliteral(self):
732
        """Parse an RFC 2822 domain-literal."""
733
        return '[%s]' % self.getdelimited('[', ']\r', 0)
Tim Peters's avatar
Tim Peters committed
734

735 736 737 738 739 740 741
    def getatom(self, atomends=None):
        """Parse an RFC 2822 atom.

        Optional atomends specifies a different set of end token delimiters
        (the default is to use self.atomends).  This is used e.g. in
        getphraselist() since phrase endings must not include the `.' (which
        is legal in phrases)."""
742
        atomlist = ['']
743 744
        if atomends is None:
            atomends = self.atomends
Tim Peters's avatar
Tim Peters committed
745

746
        while self.pos < len(self.field):
747
            if self.field[self.pos] in atomends:
748 749 750
                break
            else: atomlist.append(self.field[self.pos])
            self.pos = self.pos + 1
Tim Peters's avatar
Tim Peters committed
751

752
        return ''.join(atomlist)
Tim Peters's avatar
Tim Peters committed
753

754
    def getphraselist(self):
755
        """Parse a sequence of RFC 2822 phrases.
Tim Peters's avatar
Tim Peters committed
756

757 758 759
        A phrase is a sequence of words, which are in turn either RFC 2822
        atoms or quoted-strings.  Phrases are canonicalized by squeezing all
        runs of continuous whitespace into one space.
760 761
        """
        plist = []
Tim Peters's avatar
Tim Peters committed
762

763 764 765 766 767 768 769
        while self.pos < len(self.field):
            if self.field[self.pos] in self.LWS:
                self.pos = self.pos + 1
            elif self.field[self.pos] == '"':
                plist.append(self.getquote())
            elif self.field[self.pos] == '(':
                self.commentlist.append(self.getcomment())
770
            elif self.field[self.pos] in self.phraseends:
771
                break
772 773
            else:
                plist.append(self.getatom(self.phraseends))
Tim Peters's avatar
Tim Peters committed
774

775
        return plist
776

777
class AddressList(AddrlistClass):
778
    """An AddressList encapsulates a list of parsed RFC 2822 addresses."""
779 780 781 782 783 784 785 786 787 788 789
    def __init__(self, field):
        AddrlistClass.__init__(self, field)
        if field:
            self.addresslist = self.getaddrlist()
        else:
            self.addresslist = []

    def __len__(self):
        return len(self.addresslist)

    def __str__(self):
790
        return ", ".join(map(dump_address_pair, self.addresslist))
791 792 793 794 795 796 797 798 799 800

    def __add__(self, other):
        # Set union
        newaddr = AddressList(None)
        newaddr.addresslist = self.addresslist[:]
        for x in other.addresslist:
            if not x in self.addresslist:
                newaddr.addresslist.append(x)
        return newaddr

801 802 803 804 805 806 807
    def __iadd__(self, other):
        # Set union, in-place
        for x in other.addresslist:
            if not x in self.addresslist:
                self.addresslist.append(x)
        return self

808 809 810 811 812 813 814 815
    def __sub__(self, other):
        # Set difference
        newaddr = AddressList(None)
        for x in self.addresslist:
            if not x in other.addresslist:
                newaddr.addresslist.append(x)
        return newaddr

816 817 818 819 820 821 822
    def __isub__(self, other):
        # Set difference, in-place
        for x in other.addresslist:
            if x in self.addresslist:
                self.addresslist.remove(x)
        return self

823 824
    def __getitem__(self, index):
        # Make indexing, slices, and 'in' work
825
        return self.addresslist[index]
826

827 828 829 830 831 832
def dump_address_pair(pair):
    """Dump a (name, address) pair in a canonicalized form."""
    if pair[0]:
        return '"' + pair[0] + '" <' + pair[1] + '>'
    else:
        return pair[1]
833 834 835

# Parse a date field

Guido van Rossum's avatar
Guido van Rossum committed
836 837
_monthnames = ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul',
               'aug', 'sep', 'oct', 'nov', 'dec',
838
               'january', 'february', 'march', 'april', 'may', 'june', 'july',
Guido van Rossum's avatar
Guido van Rossum committed
839 840
               'august', 'september', 'october', 'november', 'december']
_daynames = ['mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun']
841

842 843 844 845 846 847
# The timezone table does not include the military time zones defined
# in RFC822, other than Z.  According to RFC1123, the description in
# RFC822 gets the signs wrong, so we can't rely on any such time
# zones.  RFC1123 recommends that numeric timezone indicators be used
# instead of timezone names.

Tim Peters's avatar
Tim Peters committed
848
_timezones = {'UT':0, 'UTC':0, 'GMT':0, 'Z':0,
849
              'AST': -400, 'ADT': -300,  # Atlantic (used in Canada)
850
              'EST': -500, 'EDT': -400,  # Eastern
851 852 853
              'CST': -600, 'CDT': -500,  # Central
              'MST': -700, 'MDT': -600,  # Mountain
              'PST': -800, 'PDT': -700   # Pacific
Tim Peters's avatar
Tim Peters committed
854
              }
855

856 857

def parsedate_tz(data):
858
    """Convert a date string to a time tuple.
Tim Peters's avatar
Tim Peters committed
859

860 861
    Accounts for military timezones.
    """
862 863
    if not data:
        return None
864 865
    data = data.split()
    if data[0][-1] in (',', '.') or data[0].lower() in _daynames:
866 867 868
        # There's a dayname here. Skip it
        del data[0]
    if len(data) == 3: # RFC 850 date, deprecated
869
        stuff = data[0].split('-')
870 871 872 873
        if len(stuff) == 3:
            data = stuff + data[1:]
    if len(data) == 4:
        s = data[3]
874
        i = s.find('+')
875 876 877 878 879 880 881 882
        if i > 0:
            data[3:] = [s[:i], s[i+1:]]
        else:
            data.append('') # Dummy tz
    if len(data) < 5:
        return None
    data = data[:5]
    [dd, mm, yy, tm, tz] = data
883
    mm = mm.lower()
884
    if not mm in _monthnames:
885
        dd, mm = mm, dd.lower()
886 887 888
        if not mm in _monthnames:
            return None
    mm = _monthnames.index(mm)+1
889
    if mm > 12: mm = mm - 12
Guido van Rossum's avatar
Guido van Rossum committed
890
    if dd[-1] == ',':
891
        dd = dd[:-1]
892
    i = yy.find(':')
Guido van Rossum's avatar
Guido van Rossum committed
893
    if i > 0:
894
        yy, tm = tm, yy
Guido van Rossum's avatar
Guido van Rossum committed
895
    if yy[-1] == ',':
896
        yy = yy[:-1]
897
    if not yy[0].isdigit():
898
        yy, tz = tz, yy
Guido van Rossum's avatar
Guido van Rossum committed
899
    if tm[-1] == ',':
900
        tm = tm[:-1]
901
    tm = tm.split(':')
902 903 904
    if len(tm) == 2:
        [thh, tmm] = tm
        tss = '0'
905
    elif len(tm) == 3:
906
        [thh, tmm, tss] = tm
907 908
    else:
        return None
909
    try:
910 911 912 913 914 915
        yy = int(yy)
        dd = int(dd)
        thh = int(thh)
        tmm = int(tmm)
        tss = int(tss)
    except ValueError:
916
        return None
917 918
    tzoffset = None
    tz = tz.upper()
919
    if tz in _timezones:
920
        tzoffset = _timezones[tz]
921
    else:
Tim Peters's avatar
Tim Peters committed
922
        try:
923
            tzoffset = int(tz)
Tim Peters's avatar
Tim Peters committed
924
        except ValueError:
925 926
            pass
    # Convert a timezone offset into seconds ; -0500 -> -18000
927
    if tzoffset:
928 929 930 931 932
        if tzoffset < 0:
            tzsign = -1
            tzoffset = -tzoffset
        else:
            tzsign = 1
933
        tzoffset = tzsign * ( (tzoffset//100)*3600 + (tzoffset % 100)*60)
934 935 936
    tuple = (yy, mm, dd, thh, tmm, tss, 0, 0, 0, tzoffset)
    return tuple

937

938
def parsedate(data):
939
    """Convert a time string to a time tuple."""
940 941
    t = parsedate_tz(data)
    if type(t) == type( () ):
942
        return t[:9]
Tim Peters's avatar
Tim Peters committed
943
    else: return t
944

945

946
def mktime_tz(data):
947
    """Turn a 10-tuple as returned by parsedate_tz() into a UTC timestamp."""
948
    if data[9] is None:
949 950
        # No zone info, so localtime is better assumption than GMT
        return time.mktime(data[:8] + (-1,))
951
    else:
952 953
        t = time.mktime(data[:8] + (0,))
        return t - data[9] - time.timezone
954

955 956 957 958
def formatdate(timeval=None):
    """Returns time format preferred for Internet standards.

    Sun, 06 Nov 1994 08:49:37 GMT  ; RFC 822, updated by RFC 1123
959 960 961 962 963

    According to RFC 1123, day and month names must always be in
    English.  If not for that, this code could use strftime().  It
    can't because strftime() honors the locale and could generated
    non-English names.
964 965 966
    """
    if timeval is None:
        timeval = time.time()
967 968 969 970 971 972
    timeval = time.gmtime(timeval)
    return "%s, %02d %s %04d %02d:%02d:%02d GMT" % (
            ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"][timeval[6]],
            timeval[2],
            ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
             "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"][timeval[1]-1],
Tim Peters's avatar
Tim Peters committed
973
                                timeval[0], timeval[3], timeval[4], timeval[5])
974

975 976 977 978 979 980

# When used as script, run a small test program.
# The first command line argument must be a filename containing one
# message in RFC-822 format.

if __name__ == '__main__':
981 982 983 984 985 986 987 988 989 990
    import sys, os
    file = os.path.join(os.environ['HOME'], 'Mail/inbox/1')
    if sys.argv[1:]: file = sys.argv[1]
    f = open(file, 'r')
    m = Message(f)
    print 'From:', m.getaddr('from')
    print 'To:', m.getaddrlist('to')
    print 'Subject:', m.getheader('subject')
    print 'Date:', m.getheader('date')
    date = m.getdate_tz('date')
991 992
    tz = date[-1]
    date = time.localtime(mktime_tz(date))
993
    if date:
994 995
        print 'ParsedDate:', time.asctime(date),
        hhmmss = tz
996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009
        hhmm, ss = divmod(hhmmss, 60)
        hh, mm = divmod(hhmm, 60)
        print "%+03d%02d" % (hh, mm),
        if ss: print ".%02d" % ss,
        print
    else:
        print 'ParsedDate:', None
    m.rewindbody()
    n = 0
    while f.readline():
        n = n + 1
    print 'Lines:', n
    print '-'*70
    print 'len =', len(m)
1010 1011
    if 'Date' in m: print 'Date =', m['Date']
    if 'X-Nonsense' in m: pass
1012 1013 1014
    print 'keys =', m.keys()
    print 'values =', m.values()
    print 'items =', m.items()