robotparser.rst 2.13 KB
Newer Older
1 2 3 4 5

:mod:`robotparser` ---  Parser for robots.txt
=============================================

.. module:: robotparser
Skip Montanaro's avatar
Skip Montanaro committed
6
   :synopsis: Loads a robots.txt file and answers questions about
Georg Brandl's avatar
Georg Brandl committed
7
              fetchability of other URLs.
Skip Montanaro's avatar
Skip Montanaro committed
8
.. sectionauthor:: Skip Montanaro <skip@pobox.com>
9 10 11 12 13 14 15


.. index::
   single: WWW
   single: World Wide Web
   single: URL
   single: robots.txt
Georg Brandl's avatar
Georg Brandl committed
16

17 18 19 20 21
.. note::
   The :mod:`robotparser` module has been renamed :mod:`urllib.robotparser` in
   Python 3.0.
   The :term:`2to3` tool will automatically adapt imports when converting
   your sources to 3.0.
22 23 24

This module provides a single class, :class:`RobotFileParser`, which answers
questions about whether or not a particular user agent can fetch a URL on the
25 26
Web site that published the :file:`robots.txt` file.  For more details on the
structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html.
27 28 29 30


.. class:: RobotFileParser()

Skip Montanaro's avatar
Skip Montanaro committed
31 32
   This class provides a set of methods to read, parse and answer questions
   about a single :file:`robots.txt` file.
33 34


35
   .. method:: set_url(url)
36 37 38 39

      Sets the URL referring to a :file:`robots.txt` file.


40
   .. method:: read()
41 42 43 44

      Reads the :file:`robots.txt` URL and feeds it to the parser.


45
   .. method:: parse(lines)
46 47 48 49

      Parses the lines argument.


50
   .. method:: can_fetch(useragent, url)
51

Skip Montanaro's avatar
Skip Montanaro committed
52 53 54
      Returns ``True`` if the *useragent* is allowed to fetch the *url*
      according to the rules contained in the parsed :file:`robots.txt`
      file.
55 56


57
   .. method:: mtime()
58

Skip Montanaro's avatar
Skip Montanaro committed
59 60 61
      Returns the time the ``robots.txt`` file was last fetched.  This is
      useful for long-running web spiders that need to check for new
      ``robots.txt`` files periodically.
62 63


64
   .. method:: modified()
65

Skip Montanaro's avatar
Skip Montanaro committed
66 67
      Sets the time the ``robots.txt`` file was last fetched to the current
      time.
68 69 70 71 72 73 74 75 76 77 78 79

The following example demonstrates basic use of the RobotFileParser class. ::

   >>> import robotparser
   >>> rp = robotparser.RobotFileParser()
   >>> rp.set_url("http://www.musi-cal.com/robots.txt")
   >>> rp.read()
   >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
   False
   >>> rp.can_fetch("*", "http://www.musi-cal.com/")
   True