Skip to content
Projeler
Gruplar
Parçacıklar
Yardım
Yükleniyor...
Oturum aç / Kaydol
Gezinmeyi değiştir
C
cpython
Proje
Proje
Ayrıntılar
Etkinlik
Cycle Analytics
Depo (repository)
Depo (repository)
Dosyalar
Kayıtlar (commit)
Dallar (branch)
Etiketler
Katkıda bulunanlar
Grafik
Karşılaştır
Grafikler
Konular (issue)
0
Konular (issue)
0
Liste
Pano
Etiketler
Kilometre Taşları
Birleştirme (merge) Talepleri
0
Birleştirme (merge) Talepleri
0
CI / CD
CI / CD
İş akışları (pipeline)
İşler
Zamanlamalar
Grafikler
Paketler
Paketler
Wiki
Wiki
Parçacıklar
Parçacıklar
Üyeler
Üyeler
Collapse sidebar
Close sidebar
Etkinlik
Grafik
Grafikler
Yeni bir konu (issue) oluştur
İşler
Kayıtlar (commit)
Konu (issue) Panoları
Kenar çubuğunu aç
Batuhan Osman TASKAYA
cpython
Commits
5db5c066
Kaydet (Commit)
5db5c066
authored
May 16, 2018
tarafından
Christopher Beacham
Kaydeden (comit)
Ned Deily
May 16, 2018
Dosyalara gözat
Seçenekler
Dosyalara Gözat
İndir
Eposta Yamaları
Sade Fark
bpo-21475: Support the Sitemap extension in robotparser (GH-6883)
üst
7a1c0275
Show whitespace changes
Inline
Side-by-side
Showing
5 changed files
with
47 additions
and
0 deletions
+47
-0
urllib.robotparser.rst
Doc/library/urllib.robotparser.rst
+9
-0
test_robotparser.py
Lib/test/test_robotparser.py
+21
-0
robotparser.py
Lib/urllib/robotparser.py
+12
-0
ACKS
Misc/ACKS
+2
-0
2018-05-15-15-03-48.bpo-28612.E9dz39.rst
...S.d/next/Library/2018-05-15-15-03-48.bpo-28612.E9dz39.rst
+3
-0
No files found.
Doc/library/urllib.robotparser.rst
Dosyayı görüntüle @
5db5c066
...
...
@@ -76,6 +76,15 @@ structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html.
.. versionadded:: 3.6
.. method:: site_maps()
Returns the contents of the ``Sitemap`` parameter from
``robots.txt`` in the form of a :func:`list`. If there is no such
parameter or the ``robots.txt`` entry for this parameter has
invalid syntax, return ``None``.
.. versionadded:: 3.8
The following example demonstrates basic use of the :class:`RobotFileParser`
class::
...
...
Lib/test/test_robotparser.py
Dosyayı görüntüle @
5db5c066
...
...
@@ -12,6 +12,7 @@ class BaseRobotTest:
agent
=
'test_robotparser'
good
=
[]
bad
=
[]
site_maps
=
None
def
setUp
(
self
):
lines
=
io
.
StringIO
(
self
.
robots_txt
)
.
readlines
()
...
...
@@ -36,6 +37,9 @@ class BaseRobotTest:
with
self
.
subTest
(
url
=
url
,
agent
=
agent
):
self
.
assertFalse
(
self
.
parser
.
can_fetch
(
agent
,
url
))
def
test_site_maps
(
self
):
self
.
assertEqual
(
self
.
parser
.
site_maps
(),
self
.
site_maps
)
class
UserAgentWildcardTest
(
BaseRobotTest
,
unittest
.
TestCase
):
robots_txt
=
"""
\
...
...
@@ -65,6 +69,23 @@ Disallow:
bad
=
[
'/cyberworld/map/index.html'
]
class
SitemapTest
(
BaseRobotTest
,
unittest
.
TestCase
):
robots_txt
=
"""
\
# robots.txt for http://www.example.com/
User-agent: *
Sitemap: http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml
Sitemap: http://www.google.com/hostednews/sitemap_index.xml
Request-rate: 3/15
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
"""
good
=
[
'/'
,
'/test.html'
]
bad
=
[
'/cyberworld/map/index.html'
]
site_maps
=
[
'http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml'
,
'http://www.google.com/hostednews/sitemap_index.xml'
]
class
RejectAllRobotsTest
(
BaseRobotTest
,
unittest
.
TestCase
):
robots_txt
=
"""
\
# go away
...
...
Lib/urllib/robotparser.py
Dosyayı görüntüle @
5db5c066
...
...
@@ -27,6 +27,7 @@ class RobotFileParser:
def
__init__
(
self
,
url
=
''
):
self
.
entries
=
[]
self
.
sitemaps
=
[]
self
.
default_entry
=
None
self
.
disallow_all
=
False
self
.
allow_all
=
False
...
...
@@ -141,6 +142,12 @@ class RobotFileParser:
and
numbers
[
1
]
.
strip
()
.
isdigit
()):
entry
.
req_rate
=
RequestRate
(
int
(
numbers
[
0
]),
int
(
numbers
[
1
]))
state
=
2
elif
line
[
0
]
==
"sitemap"
:
# According to http://www.sitemaps.org/protocol.html
# "This directive is independent of the user-agent line,
# so it doesn't matter where you place it in your file."
# Therefore we do not change the state of the parser.
self
.
sitemaps
.
append
(
line
[
1
])
if
state
==
2
:
self
.
_add_entry
(
entry
)
...
...
@@ -189,6 +196,11 @@ class RobotFileParser:
return
entry
.
req_rate
return
self
.
default_entry
.
req_rate
def
site_maps
(
self
):
if
not
self
.
sitemaps
:
return
None
return
self
.
sitemaps
def
__str__
(
self
):
entries
=
self
.
entries
if
self
.
default_entry
is
not
None
:
...
...
Misc/ACKS
Dosyayı görüntüle @
5db5c066
...
...
@@ -109,6 +109,7 @@ Anthony Baxter
Mike Bayer
Samuel L. Bayer
Bo Bayles
Christopher Beacham AKA Lady Red
Tommy Beadle
Donald Beaudry
David Beazley
...
...
@@ -1760,6 +1761,7 @@ Dik Winter
Blake Winton
Jean-Claude Wippler
Stéphane Wirtel
Peter Wirtz
Lars Wirzenius
John Wiseman
Chris Withers
...
...
Misc/NEWS.d/next/Library/2018-05-15-15-03-48.bpo-28612.E9dz39.rst
0 → 100644
Dosyayı görüntüle @
5db5c066
Added support for Site Maps to urllib's ``RobotFileParser`` as
:meth:`RobotFileParser.site_maps() <urllib.robotparser.RobotFileParser.site_maps>`.
Patch by Lady Red, based on patch by Peter Wirtz.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment