re.py 15.1 KB
Newer Older
1 2 3 4 5
#
# Secret Labs' Regular Expression Engine
#
# re-compatible interface for the sre matching engine
#
Fredrik Lundh's avatar
Fredrik Lundh committed
6
# Copyright (c) 1998-2001 by Secret Labs AB.  All rights reserved.
7
#
8 9 10 11
# This version of the SRE library can be redistributed under CNRI's
# Python 1.6 license.  For any other use, please contact Secret Labs
# AB (info@pythonware.com).
#
12
# Portions of this engine have been developed in cooperation with
13
# CNRI.  Hewlett-Packard provided funding for 1.6 integration and
14 15 16
# other compatibility work.
#

17
r"""Support for regular expressions (RE).
18 19

This module provides regular expression matching operations similar to
20 21 22 23 24 25 26
those found in Perl.  It supports both 8-bit and Unicode strings; both
the pattern and the strings being processed can contain null bytes and
characters outside the US ASCII range.

Regular expressions can contain both special and ordinary characters.
Most ordinary characters, like "A", "a", or "0", are the simplest
regular expressions; they simply match themselves.  You can
27 28 29 30 31
concatenate ordinary characters, so last matches the string 'last'.

The special characters are:
    "."      Matches any character except a newline.
    "^"      Matches the start of the string.
32 33
    "$"      Matches the end of the string or just before the newline at
             the end of the string.
34 35 36 37 38 39 40
    "*"      Matches 0 or more (greedy) repetitions of the preceding RE.
             Greedy means that it will match as many repetitions as possible.
    "+"      Matches 1 or more (greedy) repetitions of the preceding RE.
    "?"      Matches 0 or 1 (greedy) of the preceding RE.
    *?,+?,?? Non-greedy versions of the previous three special characters.
    {m,n}    Matches from m to n repetitions of the preceding RE.
    {m,n}?   Non-greedy version of the above.
41
    "\\"     Either escapes special characters or signals a special sequence.
42 43 44 45 46
    []       Indicates a set of characters.
             A "^" as the first character indicates a complementing set.
    "|"      A|B, creates an RE that will match either A or B.
    (...)    Matches the RE inside the parentheses.
             The contents can be retrieved or matched later in the string.
47
    (?aiLmsux) Set the A, I, L, M, S, U, or X flag for the RE (see below).
48 49 50 51 52 53
    (?:...)  Non-grouping version of regular parentheses.
    (?P<name>...) The substring matched by the group is accessible by name.
    (?P=name)     Matches the text matched earlier by the group named name.
    (?#...)  A comment; ignored.
    (?=...)  Matches if ... matches next, but doesn't consume the string.
    (?!...)  Matches if ... doesn't match next.
54 55 56 57
    (?<=...) Matches if preceded by ... (must be fixed length).
    (?<!...) Matches if not preceded by ... (must be fixed length).
    (?(id/name)yes|no) Matches yes pattern if the group with id/name matched,
                       the (optional) no pattern otherwise.
58 59

The special sequences consist of "\\" and a character from the list
60
below.  If the ordinary character is not on the list, then the
61
resulting RE will match the second character.
62 63 64 65 66
    \number  Matches the contents of the group of the same number.
    \A       Matches only at the start of the string.
    \Z       Matches only at the end of the string.
    \b       Matches the empty string, but only at the start or end of a word.
    \B       Matches the empty string, but not at the start or end of a word.
67 68 69 70 71
    \d       Matches any decimal digit; equivalent to the set [0-9] in
             bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the whole
             range of Unicode digits.
    \D       Matches any non-digit character; equivalent to [^\d].
72 73 74 75
    \s       Matches any whitespace character; equivalent to [ \t\n\r\f\v] in
             bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the whole
             range of Unicode whitespace characters.
Ezio Melotti's avatar
Ezio Melotti committed
76
    \S       Matches any non-whitespace character; equivalent to [^\s].
77 78 79 80 81
    \w       Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]
             in bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the
             range of Unicode alphanumeric characters (letters plus digits
             plus underscore).
82 83
             With LOCALE, it will match the set [0-9_] plus characters defined
             as letters for the current locale.
84 85
    \W       Matches the complement of \w.
    \\       Matches a literal backslash.
86 87

This module exports the following functions:
88 89 90 91 92 93 94 95 96 97 98
    match     Match a regular expression pattern to the beginning of a string.
    fullmatch Match a regular expression pattern to all of a string.
    search    Search a string for the presence of a pattern.
    sub       Substitute occurrences of a pattern found in a string.
    subn      Same as sub, but also return the number of substitutions made.
    split     Split a string by the occurrences of a pattern.
    findall   Find all occurrences of a pattern in a string.
    finditer  Return an iterator yielding a match object for each match.
    compile   Compile a pattern into a RegexObject.
    purge     Clear the regular expression cache.
    escape    Backslash all non-alphanumerics in a string.
99 100

Some of the functions in this module takes flags as optional parameters:
101 102 103 104 105 106
    A  ASCII       For string patterns, make \w, \W, \b, \B, \d, \D
                   match the corresponding ASCII character categories
                   (rather than the whole Unicode categories, which is the
                   default).
                   For bytes patterns, this flag is the only available
                   behaviour and needn't be specified.
107 108
    I  IGNORECASE  Perform case-insensitive matching.
    L  LOCALE      Make \w, \W, \b, \B, dependent on the current locale.
109 110 111 112
    M  MULTILINE   "^" matches the beginning of lines (after a newline)
                   as well as the string.
                   "$" matches the end of lines (before a newline) as well
                   as the end of the string.
113 114
    S  DOTALL      "." matches any character at all, including the newline.
    X  VERBOSE     Ignore whitespace and comments for nicer looking RE's.
115 116
    U  UNICODE     For compatibility only. Ignored for string patterns (it
                   is the default), and forbidden for bytes patterns.
117 118 119 120

This module also defines an exception 'error'.

"""
121

122
import sre_compile
Fredrik Lundh's avatar
Fredrik Lundh committed
123
import sre_parse
124 125 126 127
try:
    import _locale
except ImportError:
    _locale = None
128

129
# public symbols
130 131 132 133 134 135 136
__all__ = [
    "match", "fullmatch", "search", "sub", "subn", "split",
    "findall", "finditer", "compile", "purge", "template", "escape",
    "error", "A", "I", "L", "M", "S", "X", "U",
    "ASCII", "IGNORECASE", "LOCALE", "MULTILINE", "DOTALL", "VERBOSE",
    "UNICODE",
]
137

138
__version__ = "2.2.1"
Fredrik Lundh's avatar
Fredrik Lundh committed
139

140
# flags
141
A = ASCII = sre_compile.SRE_FLAG_ASCII # assume ascii "locale"
Fredrik Lundh's avatar
Fredrik Lundh committed
142 143
I = IGNORECASE = sre_compile.SRE_FLAG_IGNORECASE # ignore case
L = LOCALE = sre_compile.SRE_FLAG_LOCALE # assume current 8-bit locale
144
U = UNICODE = sre_compile.SRE_FLAG_UNICODE # assume unicode "locale"
Fredrik Lundh's avatar
Fredrik Lundh committed
145 146 147
M = MULTILINE = sre_compile.SRE_FLAG_MULTILINE # make anchors look for newline
S = DOTALL = sre_compile.SRE_FLAG_DOTALL # make dot match newline
X = VERBOSE = sre_compile.SRE_FLAG_VERBOSE # ignore whitespace and comments
148

Fredrik Lundh's avatar
Fredrik Lundh committed
149 150 151
# sre extensions (experimental, don't rely on these)
T = TEMPLATE = sre_compile.SRE_FLAG_TEMPLATE # disable backtracking
DEBUG = sre_compile.SRE_FLAG_DEBUG # dump pattern after compilation
Fredrik Lundh's avatar
Fredrik Lundh committed
152 153

# sre exception
Fredrik Lundh's avatar
Fredrik Lundh committed
154
error = sre_compile.error
Fredrik Lundh's avatar
Fredrik Lundh committed
155

156 157 158 159
# --------------------------------------------------------------------
# public interface

def match(pattern, string, flags=0):
Fredrik Lundh's avatar
Fredrik Lundh committed
160 161
    """Try to apply the pattern at the start of the string, returning
    a match object, or None if no match was found."""
162
    return _compile(pattern, flags).match(string)
163

164 165 166 167 168
def fullmatch(pattern, string, flags=0):
    """Try to apply the pattern to all of the string, returning
    a match object, or None if no match was found."""
    return _compile(pattern, flags).fullmatch(string)

169
def search(pattern, string, flags=0):
Fredrik Lundh's avatar
Fredrik Lundh committed
170 171
    """Scan through string looking for a match to the pattern, returning
    a match object, or None if no match was found."""
172 173
    return _compile(pattern, flags).search(string)

174
def sub(pattern, repl, string, count=0, flags=0):
Fredrik Lundh's avatar
Fredrik Lundh committed
175 176
    """Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
177
    replacement repl.  repl can be either a string or a callable;
178 179
    if a string, backslash escapes in it are processed.  If it is
    a callable, it's passed the match object and must return
180
    a replacement string to be used."""
181
    return _compile(pattern, flags).sub(repl, string, count)
182

183
def subn(pattern, repl, string, count=0, flags=0):
Fredrik Lundh's avatar
Fredrik Lundh committed
184 185 186 187
    """Return a 2-tuple containing (new_string, number).
    new_string is the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in the source
    string by the replacement repl.  number is the number of
188
    substitutions that were made. repl can be either a string or a
189 190
    callable; if a string, backslash escapes in it are processed.
    If it is a callable, it's passed the match object and must
191
    return a replacement string to be used."""
192
    return _compile(pattern, flags).subn(repl, string, count)
193

194
def split(pattern, string, maxsplit=0, flags=0):
Fredrik Lundh's avatar
Fredrik Lundh committed
195
    """Split the source string by the occurrences of the pattern,
196 197 198 199 200 201
    returning a list containing the resulting substrings.  If
    capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting
    list.  If maxsplit is nonzero, at most maxsplit splits occur,
    and the remainder of the string is returned as the final element
    of the list."""
202
    return _compile(pattern, flags).split(string, maxsplit)
203

204
def findall(pattern, string, flags=0):
Fredrik Lundh's avatar
Fredrik Lundh committed
205 206
    """Return a list of all non-overlapping matches in the string.

207 208
    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
Fredrik Lundh's avatar
Fredrik Lundh committed
209 210 211
    has more than one group.

    Empty matches are included in the result."""
212
    return _compile(pattern, flags).findall(string)
213

214 215 216
def finditer(pattern, string, flags=0):
    """Return an iterator over all non-overlapping matches in the
    string.  For each match, the iterator returns a match object.
217

218 219
    Empty matches are included in the result."""
    return _compile(pattern, flags).finditer(string)
220

221
def compile(pattern, flags=0):
Fredrik Lundh's avatar
Fredrik Lundh committed
222
    "Compile a regular expression pattern, returning a pattern object."
223 224
    return _compile(pattern, flags)

225
def purge():
Antoine Pitrou's avatar
Antoine Pitrou committed
226
    "Clear the regular expression caches"
227 228
    _cache.clear()
    _cache_repl.clear()
229

Fredrik Lundh's avatar
Fredrik Lundh committed
230
def template(pattern, flags=0):
Fredrik Lundh's avatar
Fredrik Lundh committed
231
    "Compile a template pattern, returning a pattern object"
Fredrik Lundh's avatar
Fredrik Lundh committed
232 233
    return _compile(pattern, flags|T)

234
_alphanum_str = frozenset(
235
    "_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890")
236
_alphanum_bytes = frozenset(
237
    b"_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890")
Raymond Hettinger's avatar
Raymond Hettinger committed
238

239
def escape(pattern):
240 241 242
    """
    Escape all the characters in pattern except ASCII letters, numbers and '_'.
    """
243 244 245
    if isinstance(pattern, str):
        alphanum = _alphanum_str
        s = list(pattern)
246
        for i, c in enumerate(pattern):
247 248 249 250 251 252 253 254 255 256 257 258 259
            if c not in alphanum:
                if c == "\000":
                    s[i] = "\\000"
                else:
                    s[i] = "\\" + c
        return "".join(s)
    else:
        alphanum = _alphanum_bytes
        s = []
        esc = ord(b"\\")
        for c in pattern:
            if c in alphanum:
                s.append(c)
260
            else:
261 262 263 264 265 266
                if c == 0:
                    s.extend(b"\\000")
                else:
                    s.append(esc)
                    s.append(c)
        return bytes(s)
267 268

# --------------------------------------------------------------------
269 270
# internals

271 272 273
_cache = {}
_cache_repl = {}

Fredrik Lundh's avatar
Fredrik Lundh committed
274 275
_pattern_type = type(sre_compile.compile("", 0))

276
_MAXCACHE = 512
277
def _compile(pattern, flags):
278
    # internal: compile pattern
279
    try:
280 281 282
        p, loc = _cache[type(pattern), pattern, flags]
        if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
            return p
283 284
    except KeyError:
        pass
285
    if isinstance(pattern, _pattern_type):
286
        if flags:
287
            raise ValueError(
288
                "cannot process flags argument with a compiled pattern")
289
        return pattern
290
    if not sre_compile.isstring(pattern):
291
        raise TypeError("first argument must be string or compiled pattern")
292
    p = sre_compile.compile(pattern, flags)
293
    if not (flags & DEBUG):
294 295
        if len(_cache) >= _MAXCACHE:
            _cache.clear()
296
        if p.flags & LOCALE:
297 298
            if not _locale:
                return p
299 300 301 302
            loc = _locale.setlocale(_locale.LC_CTYPE)
        else:
            loc = None
        _cache[type(pattern), pattern, flags] = p, loc
303
    return p
304

305
def _compile_repl(repl, pattern):
Fredrik Lundh's avatar
Fredrik Lundh committed
306
    # internal: compile replacement pattern
307 308 309 310 311 312 313 314 315
    try:
        return _cache_repl[repl, pattern]
    except KeyError:
        pass
    p = sre_parse.parse_template(repl, pattern)
    if len(_cache_repl) >= _MAXCACHE:
        _cache_repl.clear()
    _cache_repl[repl, pattern] = p
    return p
Fredrik Lundh's avatar
Fredrik Lundh committed
316

317 318 319 320 321
def _expand(pattern, match, template):
    # internal: match.expand implementation hook
    template = sre_parse.parse_template(template, pattern)
    return sre_parse.expand_template(template, match)

322 323
def _subx(pattern, template):
    # internal: pattern.sub/subn implementation helper
324 325
    template = _compile_repl(template, pattern)
    if not template[0] and len(template[1]) == 1:
326
        # literal replacement
327 328 329 330
        return template[1][0]
    def filter(match, template=template):
        return sre_parse.expand_template(template, match)
    return filter
331 332 333

# register myself for pickling

334
import copyreg
335 336 337 338

def _pickle(p):
    return _compile, (p.pattern, p.flags)

339
copyreg.pickle(_pattern_type, _pickle, _compile)
340 341 342 343 344

# --------------------------------------------------------------------
# experimental stuff (see python-dev discussions for details)

class Scanner:
345
    def __init__(self, lexicon, flags=0):
346
        from sre_constants import BRANCH, SUBPATTERN
347
        self.lexicon = lexicon
348
        # combine phrases into a compound pattern
349
        p = []
350
        s = sre_parse.Pattern()
351
        s.flags = flags
352
        for phrase, action in lexicon:
353
            gid = s.opengroup()
354
            p.append(sre_parse.SubPattern(s, [
355
                (SUBPATTERN, (gid, sre_parse.parse(phrase, flags))),
356
                ]))
357
            s.closegroup(gid, p[-1])
358 359
        p = sre_parse.SubPattern(s, [(BRANCH, (None, p))])
        self.scanner = sre_compile.compile(p)
360 361 362
    def scan(self, string):
        result = []
        append = result.append
363
        match = self.scanner.scanner(string).match
364
        i = 0
365
        while True:
366
            m = match()
367 368 369 370 371
            if not m:
                break
            j = m.end()
            if i == j:
                break
372
            action = self.lexicon[m.lastindex-1][1]
373
            if callable(action):
Fredrik Lundh's avatar
Fredrik Lundh committed
374
                self.match = m
375 376 377 378 379
                action = action(self, m.group())
            if action is not None:
                append(action)
            i = j
        return result, string[i:]