libre.tex 33.1 KB
Newer Older
Fred Drake's avatar
Fred Drake committed
1
\section{\module{re} ---
2
         Regular expression operations}
3
\declaremodule{standard}{re}
4
\moduleauthor{Andrew M. Kuchling}{amk1@bigfoot.com}
5
\moduleauthor{Fredrik Lundh}{effbot@telia.com}
6
\sectionauthor{Andrew M. Kuchling}{amk1@bigfoot.com}
7 8


9 10
\modulesynopsis{Regular expression search and match operations with a
                Perl-style expression syntax.}
11 12 13


This module provides regular expression matching operations similar to
14 15 16 17 18
those found in Perl.  Regular expression pattern strings may not
contain null bytes, but can specify the null byte using the
\code{\e\var{number}} notation.  Both patterns and strings to be
searched can be Unicode strings as well as 8-bit strings.  The
\module{re} module is always available.
19

20
Regular expressions use the backslash character (\character{\e}) to
21 22 23 24
indicate special forms or to allow special characters to be used
without invoking their special meaning.  This collides with Python's
usage of the same character for the same purpose in string literals;
for example, to match a literal backslash, one might have to write
25
\code{'\e\e\e\e'} as the pattern string, because the regular expression
Fred Drake's avatar
Fred Drake committed
26 27
must be \samp{\e\e}, and each backslash must be expressed as
\samp{\e\e} inside a regular Python string literal. 
28 29 30

The solution is to use Python's raw string notation for regular
expression patterns; backslashes are not handled in any special way in
31 32 33 34 35
a string literal prefixed with \character{r}.  So \code{r"\e n"} is a
two-character string containing \character{\e} and \character{n},
while \code{"\e n"} is a one-character string containing a newline.
Usually patterns will be expressed in Python code using this raw
string notation.
36

37 38 39 40 41 42 43 44 45
\strong{Implementation note:}
The \module{re}\refstmodindex{pre} module has two distinct
implementations: \module{sre} is the default implementation and
includes Unicode support, but may run into stack limitations for some
patterns.  Though this will be fixed for a future release of Python,
the older implementation (without Unicode support) is still available
as the \module{pre}\refstmodindex{pre} module.


46 47 48 49 50 51 52 53 54
\begin{seealso}
  \seetitle{Mastering Regular Expressions}{Book on regular expressions
            by Jeffrey Friedl, published by O'Reilly.  The Python
            material in this book dates from before the \refmodule{re}
            module, but it covers writing good regular expression
            patterns in great detail.}
\end{seealso}


Fred Drake's avatar
Fred Drake committed
55
\subsection{Regular Expression Syntax \label{re-syntax}}
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71

A regular expression (or RE) specifies a set of strings that matches
it; the functions in this module let you check if a particular string
matches a given regular expression (or if a given regular expression
matches a particular string, which comes down to the same thing).

Regular expressions can be concatenated to form new regular
expressions; if \emph{A} and \emph{B} are both regular expressions,
then \emph{AB} is also an regular expression.  If a string \emph{p}
matches A and another string \emph{q} matches B, the string \emph{pq}
will match AB.  Thus, complex expressions can easily be constructed
from simpler primitive expressions like the ones described here.  For
details of the theory and implementation of regular expressions,
consult the Friedl book referenced below, or almost any textbook about
compiler construction.

72 73 74
A brief explanation of the format of regular expressions follows.  For
further information and a gentler presentation, consult the Regular
Expression HOWTO, accessible from \url{http://www.python.org/doc/howto/}.
75 76

Regular expressions can contain both special and ordinary characters.
77
Most ordinary characters, like \character{A}, \character{a}, or \character{0},
78
are the simplest regular expressions; they simply match themselves.  
79 80 81 82
You can concatenate ordinary characters, so \regexp{last} matches the
string \code{'last'}.  (In the rest of this section, we'll write RE's in
\regexp{this special style}, usually without quotes, and strings to be
matched \code{'in single quotes'}.)
83

84
Some characters, like \character{|} or \character{(}, are special.  Special
85 86 87 88
characters either stand for classes of ordinary characters, or affect
how the regular expressions around them are interpreted.

The special characters are:
Fred Drake's avatar
Fred Drake committed
89

90
\begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
Fred Drake's avatar
Fred Drake committed
91

92
\item[\character{.}] (Dot.)  In the default mode, this matches any
Fred Drake's avatar
Fred Drake committed
93
character except a newline.  If the \constant{DOTALL} flag has been
94
specified, this matches any character including a newline.
Fred Drake's avatar
Fred Drake committed
95

96 97
\item[\character{\^}] (Caret.)  Matches the start of the string, and in
\constant{MULTILINE} mode also matches immediately after each newline.
Fred Drake's avatar
Fred Drake committed
98

99
\item[\character{\$}] Matches the end of the string, and in
Fred Drake's avatar
Fred Drake committed
100
\constant{MULTILINE} mode also matches before a newline.
101 102
\regexp{foo} matches both 'foo' and 'foobar', while the regular
expression \regexp{foo\$} matches only 'foo'.
Fred Drake's avatar
Fred Drake committed
103

104
\item[\character{*}] Causes the resulting RE to
105
match 0 or more repetitions of the preceding RE, as many repetitions
106
as are possible.  \regexp{ab*} will
107
match 'a', 'ab', or 'a' followed by any number of 'b's.
Fred Drake's avatar
Fred Drake committed
108

109
\item[\character{+}] Causes the
110
resulting RE to match 1 or more repetitions of the preceding RE.
111
\regexp{ab+} will match 'a' followed by any non-zero number of 'b's; it
112
will not match just 'a'.
Fred Drake's avatar
Fred Drake committed
113

114 115
\item[\character{?}] Causes the resulting RE to
match 0 or 1 repetitions of the preceding RE.  \regexp{ab?} will
116
match either 'a' or 'ab'.
117 118
\item[\code{*?}, \code{+?}, \code{??}] The \character{*}, \character{+}, and
\character{?} qualifiers are all \dfn{greedy}; they match as much text as
119
possible.  Sometimes this behaviour isn't desired; if the RE
120 121 122 123 124 125
\regexp{<.*>} is matched against \code{'<H1>title</H1>'}, it will match the
entire string, and not just \code{'<H1>'}.
Adding \character{?} after the qualifier makes it perform the match in
\dfn{non-greedy} or \dfn{minimal} fashion; as \emph{few} characters as
possible will be matched.  Using \regexp{.*?} in the previous
expression will match only \code{'<H1>'}.
Fred Drake's avatar
Fred Drake committed
126

Guido van Rossum's avatar
Guido van Rossum committed
127 128
\item[\code{\{\var{m},\var{n}\}}] Causes the resulting RE to match from
\var{m} to \var{n} repetitions of the preceding RE, attempting to
129 130 131
match as many repetitions as possible.  For example, \regexp{a\{3,5\}}
will match from 3 to 5 \character{a} characters.  Omitting \var{n}
specifies an infinite upper bound; you can't omit \var{m}.
Fred Drake's avatar
Fred Drake committed
132

Guido van Rossum's avatar
Guido van Rossum committed
133 134 135 136
\item[\code{\{\var{m},\var{n}\}?}] Causes the resulting RE to
match from \var{m} to \var{n} repetitions of the preceding RE,
attempting to match as \emph{few} repetitions as possible.  This is
the non-greedy version of the previous qualifier.  For example, on the
Fred Drake's avatar
Fred Drake committed
137 138 139 140 141 142 143 144
6-character string \code{'aaaaaa'}, \regexp{a\{3,5\}} will match 5
\character{a} characters, while \regexp{a\{3,5\}?} will only match 3
characters.

\item[\character{\e}] Either escapes special characters (permitting
you to match characters like \character{*}, \character{?}, and so
forth), or signals a special sequence; special sequences are discussed
below.
145 146 147 148 149 150 151

If you're not using a raw string to
express the pattern, remember that Python also uses the
backslash as an escape sequence in string literals; if the escape
sequence isn't recognized by Python's parser, the backslash and
subsequent character are included in the resulting string.  However,
if Python would recognize the resulting sequence, the backslash should
152 153 154
be repeated twice.  This is complicated and hard to understand, so
it's highly recommended that you use raw strings for all but the
simplest expressions.
Fred Drake's avatar
Fred Drake committed
155

156
\item[\code{[]}] Used to indicate a set of characters.  Characters can
Guido van Rossum's avatar
Guido van Rossum committed
157
be listed individually, or a range of characters can be indicated by
158 159
giving two characters and separating them by a \character{-}.  Special
characters are not active inside sets.  For example, \regexp{[akm\$]}
Fred Drake's avatar
Fred Drake committed
160
will match any of the characters \character{a}, \character{k},
161 162
\character{m}, or \character{\$}; \regexp{[a-z]}
will match any lowercase letter, and \code{[a-zA-Z0-9]} matches any
163 164
letter or digit.  Character classes such as \code{\e w} or \code{\e S}
(defined below) are also acceptable inside a range.  If you want to
165 166 167 168 169 170 171
include a \character{]} or a \character{-} inside a set, precede it with a
backslash, or place it as the first character.  The 
pattern \regexp{[]]} will match \code{']'}, for example.  

You can match the characters not within a range by \dfn{complementing}
the set.  This is indicated by including a
\character{\^} as the first character of the set; \character{\^} elsewhere will
172
simply match the \character{\^} character.  For example, \regexp{[{\^}5]}
173
will match any character except \character{5}.
174

175
\item[\character{|}]\code{A|B}, where A and B can be arbitrary REs,
176 177 178 179 180 181 182 183 184 185
creates a regular expression that will match either A or B.  An
arbitrary number of REs can be separated by the \character{|} in this
way.  This can be used inside groups (see below) as well.  REs
separated by \character{|} are tried from left to right, and the first
one that allows the complete pattern to match is considered the
accepted branch.  This means that if \code{A} matches, \code{B} will
never be tested, even if it would produce a longer overall match.  In
other words, the \character{|} operator is never greedy.  To match a
literal \character{|}, use \regexp{\e|}, or enclose it inside a
character class, as in \regexp{[|]}.
Fred Drake's avatar
Fred Drake committed
186

Guido van Rossum's avatar
Guido van Rossum committed
187 188 189
\item[\code{(...)}] Matches whatever regular expression is inside the
parentheses, and indicates the start and end of a group; the contents
of a group can be retrieved after a match has been performed, and can
190
be matched later in the string with the \regexp{\e \var{number}} special
Fred Drake's avatar
Fred Drake committed
191
sequence, described below.  To match the literals \character{(} or
192
\character{)}, use \regexp{\e(} or \regexp{\e)}, or enclose them
Fred Drake's avatar
Fred Drake committed
193 194 195 196 197
inside a character class: \regexp{[(] [)]}.

\item[\code{(?...)}] This is an extension notation (a \character{?}
following a \character{(} is not meaningful otherwise).  The first
character after the \character{?} 
198
determines what the meaning and further syntax of the construct is.
199
Extensions usually do not create a new group;
200
\regexp{(?P<\var{name}>...)} is the only exception to this rule.
201
Following are the currently supported extensions.
Fred Drake's avatar
Fred Drake committed
202

203 204 205 206 207 208 209 210
\item[\code{(?iLmsux)}] (One or more letters from the set \character{i},
\character{L}, \character{m}, \character{s}, \character{u},
\character{x}.)  The group matches the empty string; the letters set
the corresponding flags (\constant{re.I}, \constant{re.L},
\constant{re.M}, \constant{re.S}, \constant{re.U}, \constant{re.X})
for the entire regular expression.  This is useful if you wish to
include the flags as part of the regular expression, instead of
passing a \var{flag} argument to the \function{compile()} function.
Fred Drake's avatar
Fred Drake committed
211

212 213 214 215 216
Note that the \regexp{(?x)} flag changes how the expression is parsed.
It should be used first in the expression string, or after one or more
whitespace characters.  If there are non-whitespace characters before
the flag, the results are undefined.

217
\item[\code{(?:...)}] A non-grouping version of regular parentheses.
218 219
Matches whatever regular expression is inside the parentheses, but the
substring matched by the 
220 221
group \emph{cannot} be retrieved after performing a match or
referenced later in the pattern. 
Fred Drake's avatar
Fred Drake committed
222

223
\item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
224
the substring matched by the group is accessible via the symbolic group
225 226 227 228 229
name \var{name}.  Group names must be valid Python identifiers.  A
symbolic group is also a numbered group, just as if the group were not
named.  So the group named 'id' in the example above can also be
referenced as the numbered group 1.

Guido van Rossum's avatar
Guido van Rossum committed
230
For example, if the pattern is
231
\regexp{(?P<id>[a-zA-Z_]\e w*)}, the group can be referenced by its
232
name in arguments to methods of match objects, such as \code{m.group('id')}
233
or \code{m.end('id')}, and also by name in pattern text
234
(e.g. \regexp{(?P=id)}) and replacement text (e.g. \code{\e g<id>}).
Fred Drake's avatar
Fred Drake committed
235

236 237
\item[\code{(?P=\var{name})}] Matches whatever text was matched by the
earlier group named \var{name}.
Fred Drake's avatar
Fred Drake committed
238

239 240
\item[\code{(?\#...)}] A comment; the contents of the parentheses are
simply ignored.
Fred Drake's avatar
Fred Drake committed
241

242
\item[\code{(?=...)}] Matches if \regexp{...} matches next, but doesn't
243
consume any of the string.  This is called a lookahead assertion.  For
244 245
example, \regexp{Isaac (?=Asimov)} will match \code{'Isaac~'} only if it's
followed by \code{'Asimov'}.
Fred Drake's avatar
Fred Drake committed
246

247
\item[\code{(?!...)}] Matches if \regexp{...} doesn't match next.  This
248
is a negative lookahead assertion.  For example,
249 250
\regexp{Isaac (?!Asimov)} will match \code{'Isaac~'} only if it's \emph{not}
followed by \code{'Asimov'}.
251

252 253 254 255 256 257 258 259 260 261 262 263 264 265 266
\item[\code{(?<=...)}] Matches if the current position in the string
is preceded by a match for \regexp{...} that ends at the current
position.  This is called a positive lookbehind assertion.
\regexp{(?<=abc)def} will match \samp{abcdef}, since the lookbehind
will back up 3 characters and check if the contained pattern matches.
The contained pattern must only match strings of some fixed length,
meaning that \regexp{abc} or \regexp{a|b} are allowed, but \regexp{a*}
isn't.

\item[\code{(?<!...)}] Matches if the current position in the string
is not preceded by a match for \regexp{...}.  This
is called a negative lookbehind assertion.  Similar to positive lookbehind
assertions, the contained pattern must only match strings of some
fixed length.

267
\end{list}
268

269
The special sequences consist of \character{\e} and a character from the
270 271
list below.  If the ordinary character is not on the list, then the
resulting RE will match the second character.  For example,
272
\regexp{\e\$} matches the character \character{\$}.
273

274
\begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
275 276

\item[\code{\e \var{number}}] Matches the contents of the group of the
277
same number.  Groups are numbered starting from 1.  For example,
278 279
\regexp{(.+) \e 1} matches \code{'the the'} or \code{'55 55'}, but not
\code{'the end'} (note 
280 281 282 283
the space after the group).  This special sequence can only be used to
match one of the first 99 groups.  If the first digit of \var{number}
is 0, or \var{number} is 3 octal digits long, it will not be interpreted
as a group match, but as the character with octal value \var{number}.
284
Inside the \character{[} and \character{]} of a character class, all numeric
285
escapes are treated as characters. 
286

287
\item[\code{\e A}] Matches only at the start of the string.
288

289 290 291
\item[\code{\e b}] Matches the empty string, but only at the
beginning or end of a word.  A word is defined as a sequence of
alphanumeric characters, so the end of a word is indicated by
Guido van Rossum's avatar
Guido van Rossum committed
292
whitespace or a non-alphanumeric character.  Inside a character range,
293
\regexp{\e b} represents the backspace character, for compatibility with
Guido van Rossum's avatar
Guido van Rossum committed
294
Python's string literals.
295

296 297
\item[\code{\e B}] Matches the empty string, but only when it is
\emph{not} at the beginning or end of a word.
298

299
\item[\code{\e d}]Matches any decimal digit; this is
300
equivalent to the set \regexp{[0-9]}.
301

302
\item[\code{\e D}]Matches any non-digit character; this is
303
equivalent to the set \regexp{[{\^}0-9]}.
304

305
\item[\code{\e s}]Matches any whitespace character; this is
306
equivalent to the set \regexp{[ \e t\e n\e r\e f\e v]}.
307

308
\item[\code{\e S}]Matches any non-whitespace character; this is
309
equivalent to the set \regexp{[\^\ \e t\e n\e r\e f\e v]}.
310 311 312

\item[\code{\e w}]When the \constant{LOCALE} and \constant{UNICODE}
flags are not specified,
313
matches any alphanumeric character; this is equivalent to the set
314
\regexp{[a-zA-Z0-9_]}.  With \constant{LOCALE}, it will match the set
315 316 317 318 319 320 321 322 323 324 325 326 327
\regexp{[0-9_]} plus whatever characters are defined as letters for
the current locale.  If \constant{UNICODE} is set, this will match the
characters \regexp{[0-9_]} plus whatever is classified as alphanumeric
in the Unicode character properties database.

\item[\code{\e W}]When the \constant{LOCALE} and \constant{UNICODE}
flags are not specified, matches any non-alphanumeric character; this
is equivalent to the set \regexp{[{\^}a-zA-Z0-9_]}.   With
\constant{LOCALE}, it will match any character not in the set
\regexp{[0-9_]}, and not defined as a letter for the current locale.
If \constant{UNICODE} is set, this will match anything other than
\regexp{[0-9_]} and characters marked at alphanumeric in the Unicode
character properties database.
328 329 330 331 332

\item[\code{\e Z}]Matches only at the end of the string.

\item[\code{\e \e}] Matches a literal backslash.

333
\end{list}
334

335

336 337 338 339 340 341 342 343 344 345
\subsection{Matching vs. Searching \label{matching-searching}}
\sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org}

Python offers two different primitive operations based on regular
expressions: match and search.  If you are accustomed to Perl's
semantics, the search operation is what you're looking for.  See the
\function{search()} function and corresponding method of compiled
regular expression objects.

Note that match may differ from search using a regular expression
346 347 348 349 350 351
beginning with \character{\^}: \character{\^} matches only at the
start of the string, or in \constant{MULTILINE} mode also immediately
following a newline.  The ``match'' operation succeeds only if the
pattern matches at the start of the string regardless of mode, or at
the starting position given by the optional \var{pos} argument
regardless of whether a newline precedes it.
352 353 354 355 356 357 358 359 360 361 362

% Examples from Tim Peters:
\begin{verbatim}
re.compile("a").match("ba", 1)           # succeeds
re.compile("^a").search("ba", 1)         # fails; 'a' not at start
re.compile("^a").search("\na", 1)        # fails; 'a' not at start
re.compile("^a", re.M).search("\na", 1)  # succeeds
re.compile("^a", re.M).search("ba", 1)   # fails; no preceding \n
\end{verbatim}


363
\subsection{Module Contents}
364
\nodename{Contents of Module re}
365 366 367 368

The module defines the following functions and constants, and an exception:


369
\begin{funcdesc}{compile}{pattern\optional{, flags}}
370
  Compile a regular expression pattern into a regular expression
Fred Drake's avatar
Fred Drake committed
371 372
  object, which can be used for matching using its \function{match()} and
  \function{search()} methods, described below.  
373

374 375 376 377
  The expression's behaviour can be modified by specifying a
  \var{flags} value.  Values can be any of the following variables,
  combined using bitwise OR (the \code{|} operator).

Fred Drake's avatar
Fred Drake committed
378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393
The sequence

\begin{verbatim}
prog = re.compile(pat)
result = prog.match(str)
\end{verbatim}

is equivalent to

\begin{verbatim}
result = re.match(pat, str)
\end{verbatim}

but the version using \function{compile()} is more efficient when the
expression will be used several times in a single program.
%(The compiled version of the last pattern passed to
394
%\function{re.match()} or \function{re.search()} is cached, so
Fred Drake's avatar
Fred Drake committed
395 396 397 398
%programs that use only a single regular expression at a time needn't
%worry about compiling regular expressions.)
\end{funcdesc}

399 400
\begin{datadesc}{I}
\dataline{IGNORECASE}
401
Perform case-insensitive matching; expressions like \regexp{[A-Z]} will match
Guido van Rossum's avatar
Guido van Rossum committed
402
lowercase letters, too.  This is not affected by the current locale.
403
\end{datadesc}
404

405 406
\begin{datadesc}{L}
\dataline{LOCALE}
407 408
Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, and
\regexp{\e B} dependent on the current locale. 
409
\end{datadesc}
410

411 412
\begin{datadesc}{M}
\dataline{MULTILINE}
413
When specified, the pattern character \character{\^} matches at the
414 415
beginning of the string and at the beginning of each line
(immediately following each newline); and the pattern character
416
\character{\$} matches at the end of the string and at the end of each line
Guido van Rossum's avatar
Guido van Rossum committed
417
(immediately preceding each newline).
418 419
By default, \character{\^} matches only at the beginning of the string, and
\character{\$} only at the end of the string and immediately before the
420
newline (if any) at the end of the string. 
421
\end{datadesc}
Guido van Rossum's avatar
Guido van Rossum committed
422

423 424
\begin{datadesc}{S}
\dataline{DOTALL}
425 426 427 428 429 430 431 432 433 434
Make the \character{.} special character match any character at all,
including a newline; without this flag, \character{.} will match
anything \emph{except} a newline.
\end{datadesc}

\begin{datadesc}{U}
\dataline{UNICODE}
Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, and
\regexp{\e B} dependent on the Unicode character properties database.
\versionadded{2.0}
435
\end{datadesc}
436

437 438
\begin{datadesc}{X}
\dataline{VERBOSE}
439 440
This flag allows you to write regular expressions that look nicer.
Whitespace within the pattern is ignored, 
Guido van Rossum's avatar
Guido van Rossum committed
441
except when in a character class or preceded by an unescaped
442
backslash, and, when a line contains a \character{\#} neither in a character
Guido van Rossum's avatar
Guido van Rossum committed
443
class or preceded by an unescaped backslash, all characters from the
444 445
leftmost such \character{\#} through the end of the line are ignored.
% XXX should add an example here
446
\end{datadesc}
447 448


449 450 451 452 453 454 455
\begin{funcdesc}{search}{pattern, string\optional{, flags}}
  Scan through \var{string} looking for a location where the regular
  expression \var{pattern} produces a match, and return a
  corresponding \class{MatchObject} instance.
  Return \code{None} if no
  position in the string matches the pattern; note that this is
  different from finding a zero-length match at some point in the string.
456 457
\end{funcdesc}

458
\begin{funcdesc}{match}{pattern, string\optional{, flags}}
459 460
  If zero or more characters at the beginning of \var{string} match
  the regular expression \var{pattern}, return a corresponding
Fred Drake's avatar
Fred Drake committed
461
  \class{MatchObject} instance.  Return \code{None} if the string does not
462 463
  match the pattern; note that this is different from a zero-length
  match.
464 465 466

  \strong{Note:}  If you want to locate a match anywhere in
  \var{string}, use \method{search()} instead.
467 468
\end{funcdesc}

469
\begin{funcdesc}{split}{pattern, string\optional{, maxsplit\code{ = 0}}}
470
  Split \var{string} by the occurrences of \var{pattern}.  If
471 472
  capturing parentheses are used in \var{pattern}, then the text of all
  groups in the pattern are also returned as part of the resulting list.
473 474 475 476 477
  If \var{maxsplit} is nonzero, at most \var{maxsplit} splits
  occur, and the remainder of the string is returned as the final
  element of the list.  (Incompatibility note: in the original Python
  1.5 release, \var{maxsplit} was ignored.  This has been fixed in
  later releases.)
478

479
\begin{verbatim}
480
>>> re.split('\W+', 'Words, words, words.')
481
['Words', 'words', 'words', '']
482
>>> re.split('(\W+)', 'Words, words, words.')
483
['Words', ', ', 'words', ', ', 'words', '.', '']
484
>>> re.split('\W+', 'Words, words, words.', 1)
485
['Words', 'words, words.']
486
\end{verbatim}
487

488
  This function combines and extends the functionality of
Fred Drake's avatar
Fred Drake committed
489
  the old \function{regsub.split()} and \function{regsub.splitx()}.
490 491
\end{funcdesc}

492 493 494 495 496
\begin{funcdesc}{findall}{pattern, string}
Return a list of all non-overlapping matches of \var{pattern} in
\var{string}.  If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group.  Empty matches are included in the result.
497
\versionadded{1.5.2}
498 499
\end{funcdesc}

500
\begin{funcdesc}{sub}{pattern, repl, string\optional{, count\code{ = 0}}}
501 502
Return the string obtained by replacing the leftmost non-overlapping
occurrences of \var{pattern} in \var{string} by the replacement
503 504
\var{repl}.  If the pattern isn't found, \var{string} is returned
unchanged.  \var{repl} can be a string or a function; if a function,
505
it is called for every non-overlapping occurrence of \var{pattern}.
506 507
The function takes a single match object argument, and returns the
replacement string.  For example:
508

509
\begin{verbatim}
510
>>> def dashrepl(matchobj):
511 512
....    if matchobj.group(0) == '-': return ' '
....    else: return '-'
513 514
>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
'pro--gram files'
515
\end{verbatim}
516

517 518
The pattern may be a string or an RE object; if you need to specify
regular expression flags, you must use a RE object, or use
Guido van Rossum's avatar
Guido van Rossum committed
519
embedded modifiers in a pattern; e.g.
520
\samp{sub("(?i)b+", "x", "bbbb BBBB")} returns \code{'x x'}.
521

522
The optional argument \var{count} is the maximum number of pattern
523
occurrences to be replaced; \var{count} must be a non-negative integer, and
524 525 526
the default value of 0 means to replace all occurrences.

Empty matches for the pattern are replaced only when not adjacent to a
527
previous match, so \samp{sub('x*', '-', 'abc')} returns \code{'-a-b-c-'}.
528 529 530 531

If \var{repl} is a string, any backslash escapes in it are processed.
That is, \samp{\e n} is converted to a single newline character,
\samp{\e r} is converted to a linefeed, and so forth.  Unknown escapes
532
such as \samp{\e j} are left alone.  Backreferences, such as \samp{\e 6}, are
533 534 535 536
replaced with the substring matched by group 6 in the pattern. 

In addition to character escapes and backreferences as described
above, \samp{\e g<name>} will use the substring matched by the group
537
named \samp{name}, as defined by the \regexp{(?P<name>...)} syntax.
538 539 540 541
\samp{\e g<number>} uses the corresponding group number; \samp{\e
g<2>} is therefore equivalent to \samp{\e 2}, but isn't ambiguous in a
replacement such as \samp{\e g<2>0}.  \samp{\e 20} would be
interpreted as a reference to group 20, not a reference to group 2
542
followed by the literal character \character{0}.  
543 544
\end{funcdesc}

545
\begin{funcdesc}{subn}{pattern, repl, string\optional{, count\code{ = 0}}}
Fred Drake's avatar
Fred Drake committed
546
Perform the same operation as \function{sub()}, but return a tuple
547
\code{(\var{new_string}, \var{number_of_subs_made})}.
548 549
\end{funcdesc}

550 551 552 553 554 555
\begin{funcdesc}{escape}{string}
  Return \var{string} with all non-alphanumerics backslashed; this is
  useful if you want to match an arbitrary literal string that may have
  regular expression metacharacters in it.
\end{funcdesc}

556 557 558
\begin{excdesc}{error}
  Exception raised when a string passed to one of the functions here
  is not a valid regular expression (e.g., unmatched parentheses) or
559 560
  when some other error occurs during compilation or matching.  It is
  never an error if a string contains no match for a pattern.
561 562
\end{excdesc}

563

Fred Drake's avatar
Fred Drake committed
564
\subsection{Regular Expression Objects \label{re-objects}}
565

566 567 568
Compiled regular expression objects support the following methods and
attributes:

569 570
\begin{methoddesc}[RegexObject]{search}{string\optional{, pos\optional{,
                                        endpos}}}
571 572 573 574 575 576 577 578 579 580
  Scan through \var{string} looking for a location where this regular
  expression produces a match, and return a
  corresponding \class{MatchObject} instance.  Return \code{None} if no
  position in the string matches the pattern; note that this is
  different from finding a zero-length match at some point in the string.
  
  The optional \var{pos} and \var{endpos} parameters have the same
  meaning as for the \method{match()} method.
\end{methoddesc}

581 582
\begin{methoddesc}[RegexObject]{match}{string\optional{, pos\optional{,
                                       endpos}}}
583 584
  If zero or more characters at the beginning of \var{string} match
  this regular expression, return a corresponding
Fred Drake's avatar
Fred Drake committed
585
  \class{MatchObject} instance.  Return \code{None} if the string does not
586 587
  match the pattern; note that this is different from a zero-length
  match.
588 589 590 591

  \strong{Note:}  If you want to locate a match anywhere in
  \var{string}, use \method{search()} instead.

592
  The optional second parameter \var{pos} gives an index in the string
593 594 595 596 597
  where the search is to start; it defaults to \code{0}.  This is not
  completely equivalent to slicing the string; the \code{'\^'} pattern
  character matches at the real beginning of the string and at positions
  just after a newline, but not necessarily at the index where the search
  is to start.
598 599 600 601 602

  The optional parameter \var{endpos} limits how far the string will
  be searched; it will be as if the string is \var{endpos} characters
  long, so only the characters from \var{pos} to \var{endpos} will be
  searched for a match.
Fred Drake's avatar
Fred Drake committed
603
\end{methoddesc}
604

605
\begin{methoddesc}[RegexObject]{split}{string\optional{,
Fred Drake's avatar
Fred Drake committed
606
                                       maxsplit\code{ = 0}}}
Fred Drake's avatar
Fred Drake committed
607
Identical to the \function{split()} function, using the compiled pattern.
Fred Drake's avatar
Fred Drake committed
608
\end{methoddesc}
609

610 611 612 613
\begin{methoddesc}[RegexObject]{findall}{string}
Identical to the \function{findall()} function, using the compiled pattern.
\end{methoddesc}

Fred Drake's avatar
Fred Drake committed
614
\begin{methoddesc}[RegexObject]{sub}{repl, string\optional{, count\code{ = 0}}}
Fred Drake's avatar
Fred Drake committed
615
Identical to the \function{sub()} function, using the compiled pattern.
Fred Drake's avatar
Fred Drake committed
616
\end{methoddesc}
617

Fred Drake's avatar
Fred Drake committed
618 619
\begin{methoddesc}[RegexObject]{subn}{repl, string\optional{,
                                      count\code{ = 0}}}
Fred Drake's avatar
Fred Drake committed
620
Identical to the \function{subn()} function, using the compiled pattern.
Fred Drake's avatar
Fred Drake committed
621
\end{methoddesc}
622 623


Fred Drake's avatar
Fred Drake committed
624
\begin{memberdesc}[RegexObject]{flags}
625
The flags argument used when the RE object was compiled, or
626
\code{0} if no flags were provided.
Fred Drake's avatar
Fred Drake committed
627
\end{memberdesc}
628

Fred Drake's avatar
Fred Drake committed
629
\begin{memberdesc}[RegexObject]{groupindex}
630
A dictionary mapping any symbolic group names defined by 
631
\regexp{(?P<\var{id}>)} to group numbers.  The dictionary is empty if no
632
symbolic groups were used in the pattern.
Fred Drake's avatar
Fred Drake committed
633
\end{memberdesc}
634

Fred Drake's avatar
Fred Drake committed
635
\begin{memberdesc}[RegexObject]{pattern}
636
The pattern string from which the RE object was compiled.
Fred Drake's avatar
Fred Drake committed
637
\end{memberdesc}
638

639

Fred Drake's avatar
Fred Drake committed
640
\subsection{Match Objects \label{match-objects}}
641

Fred Drake's avatar
Fred Drake committed
642
\class{MatchObject} instances support the following methods and attributes:
643

644 645 646 647 648 649 650 651 652
\begin{methoddesc}[MatchObject]{expand}{template}
 Return the string obtained by doing backslash substitution on the
template string \var{template}, as done by the \method{sub()} method.
Escapes such as \samp{\e n} are converted to the appropriate
characters, and numeric backreferences (\samp{\e 1}, \samp{\e 2}) and named
backreferences (\samp{\e g<1>}, \samp{\e g<name>}) are replaced by the contents of the
corresponding group.
\end{methoddesc}

653
\begin{methoddesc}[MatchObject]{group}{\optional{group1, \moreargs}}
654 655
Returns one or more subgroups of the match.  If there is a single
argument, the result is a single string; if there are
Guido van Rossum's avatar
Guido van Rossum committed
656
multiple arguments, the result is a tuple with one item per argument.
657 658 659
Without arguments, \var{group1} defaults to zero (i.e. the whole match
is returned).
If a \var{groupN} argument is zero, the corresponding return value is the
Guido van Rossum's avatar
Guido van Rossum committed
660
entire matching string; if it is in the inclusive range [1..99], it is
661 662 663 664
the string matching the the corresponding parenthesized group.  If a
group number is negative or larger than the number of groups defined
in the pattern, an \exception{IndexError} exception is raised.
If a group is contained in a part of the pattern that did not match,
665
the corresponding result is \code{-1}.  If a group is contained in a 
666 667
part of the pattern that matched multiple times, the last match is
returned.
668

669
If the regular expression uses the \regexp{(?P<\var{name}>...)} syntax,
670
the \var{groupN} arguments may also be strings identifying groups by
671 672
their group name.  If a string argument is not used as a group name in 
the pattern, an \exception{IndexError} exception is raised.
Guido van Rossum's avatar
Guido van Rossum committed
673 674

A moderately complicated example:
675 676

\begin{verbatim}
Guido van Rossum's avatar
Guido van Rossum committed
677
m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
678 679 680
\end{verbatim}

After performing this match, \code{m.group(1)} is \code{'3'}, as is
681
\code{m.group('int')}, and \code{m.group(2)} is \code{'14'}.
Fred Drake's avatar
Fred Drake committed
682
\end{methoddesc}
683

684
\begin{methoddesc}[MatchObject]{groups}{\optional{default}}
Guido van Rossum's avatar
Guido van Rossum committed
685
Return a tuple containing all the subgroups of the match, from 1 up to
686 687 688 689 690 691 692 693 694 695 696 697 698
however many groups are in the pattern.  The \var{default} argument is
used for groups that did not participate in the match; it defaults to
\code{None}.  (Incompatibility note: in the original Python 1.5
release, if the tuple was one element long, a string would be returned
instead.  In later versions (from 1.5.1 on), a singleton tuple is
returned in such cases.)
\end{methoddesc}

\begin{methoddesc}[MatchObject]{groupdict}{\optional{default}}
Return a dictionary containing all the \emph{named} subgroups of the
match, keyed by the subgroup name.  The \var{default} argument is
used for groups that did not participate in the match; it defaults to
\code{None}.
Fred Drake's avatar
Fred Drake committed
699
\end{methoddesc}
Guido van Rossum's avatar
Guido van Rossum committed
700

Fred Drake's avatar
Fred Drake committed
701
\begin{methoddesc}[MatchObject]{start}{\optional{group}}
702
\funcline{end}{\optional{group}}
Guido van Rossum's avatar
Guido van Rossum committed
703
Return the indices of the start and end of the substring
704 705
matched by \var{group}; \var{group} defaults to zero (meaning the whole
matched substring).
706
Return \code{-1} if \var{group} exists but
Guido van Rossum's avatar
Guido van Rossum committed
707
did not contribute to the match.  For a match object
708 709 710 711 712 713 714 715
\var{m}, and a group \var{g} that did contribute to the match, the
substring matched by group \var{g} (equivalent to
\code{\var{m}.group(\var{g})}) is

\begin{verbatim}
m.string[m.start(g):m.end(g)]
\end{verbatim}

Guido van Rossum's avatar
Guido van Rossum committed
716 717
Note that
\code{m.start(\var{group})} will equal \code{m.end(\var{group})} if
718 719 720 721
\var{group} matched a null string.  For example, after \code{\var{m} =
re.search('b(c?)', 'cba')}, \code{\var{m}.start(0)} is 1,
\code{\var{m}.end(0)} is 2, \code{\var{m}.start(1)} and
\code{\var{m}.end(1)} are both 2, and \code{\var{m}.start(2)} raises
Fred Drake's avatar
Fred Drake committed
722
an \exception{IndexError} exception.
Fred Drake's avatar
Fred Drake committed
723
\end{methoddesc}
Guido van Rossum's avatar
Guido van Rossum committed
724

Fred Drake's avatar
Fred Drake committed
725
\begin{methoddesc}[MatchObject]{span}{\optional{group}}
Fred Drake's avatar
Fred Drake committed
726
For \class{MatchObject} \var{m}, return the 2-tuple
727
\code{(\var{m}.start(\var{group}), \var{m}.end(\var{group}))}.
Guido van Rossum's avatar
Guido van Rossum committed
728
Note that if \var{group} did not contribute to the match, this is
729
\code{(-1, -1)}.  Again, \var{group} defaults to zero.
Fred Drake's avatar
Fred Drake committed
730
\end{methoddesc}
Guido van Rossum's avatar
Guido van Rossum committed
731

Fred Drake's avatar
Fred Drake committed
732
\begin{memberdesc}[MatchObject]{pos}
733
The value of \var{pos} which was passed to the
734 735
\function{search()} or \function{match()} function.  This is the index
into the string at which the RE engine started looking for a match. 
Fred Drake's avatar
Fred Drake committed
736
\end{memberdesc}
737

Fred Drake's avatar
Fred Drake committed
738
\begin{memberdesc}[MatchObject]{endpos}
739
The value of \var{endpos} which was passed to the
740 741
\function{search()} or \function{match()} function.  This is the index
into the string beyond which the RE engine will not go.
Fred Drake's avatar
Fred Drake committed
742
\end{memberdesc}
743

744 745 746 747 748 749 750 751 752 753
\begin{memberdesc}[MatchObject]{lastgroup}
The name of the last matched capturing group, or \code{None} if the
group didn't have a name, or if no group was matched at all.
\end{memberdesc}

\begin{memberdesc}[MatchObject]{lastindex}
The integer index of the last matched capturing group, or \code{None}
if no group was matched at all.
\end{memberdesc}

Fred Drake's avatar
Fred Drake committed
754
\begin{memberdesc}[MatchObject]{re}
Fred Drake's avatar
Fred Drake committed
755 756
The regular expression object whose \method{match()} or
\method{search()} method produced this \class{MatchObject} instance.
Fred Drake's avatar
Fred Drake committed
757
\end{memberdesc}
758

Fred Drake's avatar
Fred Drake committed
759
\begin{memberdesc}[MatchObject]{string}
Fred Drake's avatar
Fred Drake committed
760
The string passed to \function{match()} or \function{search()}.
Fred Drake's avatar
Fred Drake committed
761
\end{memberdesc}