shlex.rst 11.7 KB
Newer Older
1 2 3 4 5 6 7 8 9 10
:mod:`shlex` --- Simple lexical analysis
========================================

.. module:: shlex
   :synopsis: Simple lexical analysis for Unix shell-like languages.
.. moduleauthor:: Eric S. Raymond <esr@snark.thyrsus.com>
.. moduleauthor:: Gustavo Niemeyer <niemeyer@conectiva.com>
.. sectionauthor:: Eric S. Raymond <esr@snark.thyrsus.com>
.. sectionauthor:: Gustavo Niemeyer <niemeyer@conectiva.com>

Raymond Hettinger's avatar
Raymond Hettinger committed
11 12 13
**Source code:** :source:`Lib/shlex.py`

--------------
14

15 16 17
The :class:`~shlex.shlex` class makes it easy to write lexical analyzers for
simple syntaxes resembling that of the Unix shell.  This will often be useful
for writing minilanguages, (for example, in run control files for Python
18 19 20 21 22
applications) or for parsing quoted strings.

The :mod:`shlex` module defines the following functions:


23
.. function:: split(s, comments=False, posix=True)
24 25 26

   Split the string *s* using shell-like syntax. If *comments* is :const:`False`
   (the default), the parsing of comments in the given string will be disabled
27 28 29 30
   (setting the :attr:`~shlex.commenters` attribute of the
   :class:`~shlex.shlex` instance to the empty string).  This function operates
   in POSIX mode by default, but uses non-POSIX mode if the *posix* argument is
   false.
31 32 33

   .. note::

34 35 36
      Since the :func:`split` function instantiates a :class:`~shlex.shlex`
      instance, passing ``None`` for *s* will read the string to split from
      standard input.
37

38 39 40 41

.. function:: quote(s)

   Return a shell-escaped version of the string *s*.  The returned value is a
42 43 44 45 46 47 48 49 50 51 52
   string that can safely be used as one token in a shell command line, for
   cases where you cannot use a list.

   This idiom would be unsafe::

      >>> filename = 'somefile; rm -rf ~'
      >>> command = 'ls -l {}'.format(filename)
      >>> print(command)  # executed by a shell: boom!
      ls -l somefile; rm -rf ~

   :func:`quote` lets you plug the security hole::
53 54 55

      >>> command = 'ls -l {}'.format(quote(filename))
      >>> print(command)
56
      ls -l 'somefile; rm -rf ~'
57 58
      >>> remote_command = 'ssh home {}'.format(quote(command))
      >>> print(remote_command)
59 60 61 62 63 64 65 66 67 68
      ssh home 'ls -l '"'"'somefile; rm -rf ~'"'"''

   The quoting is compatible with UNIX shells and with :func:`split`:

      >>> remote_command = split(remote_command)
      >>> remote_command
      ['ssh', 'home', "ls -l 'somefile; rm -rf ~'"]
      >>> command = split(remote_command[-1])
      >>> command
      ['ls', '-l', 'somefile; rm -rf ~']
69

70
   .. versionadded:: 3.3
71

72 73 74
The :mod:`shlex` module defines the following class:


75
.. class:: shlex(instream=None, infile=None, posix=False)
76

77 78 79 80 81 82 83 84 85 86 87 88 89
   A :class:`~shlex.shlex` instance or subclass instance is a lexical analyzer
   object.  The initialization argument, if present, specifies where to read
   characters from. It must be a file-/stream-like object with
   :meth:`~io.TextIOBase.read` and :meth:`~io.TextIOBase.readline` methods, or
   a string.  If no argument is given, input will be taken from ``sys.stdin``.
   The second optional argument is a filename string, which sets the initial
   value of the :attr:`~shlex.infile` attribute.  If the *instream*
   argument is omitted or equal to ``sys.stdin``, this second argument
   defaults to "stdin".  The *posix* argument defines the operational mode:
   when *posix* is not true (default), the :class:`~shlex.shlex` instance will
   operate in compatibility mode.  When operating in POSIX mode,
   :class:`~shlex.shlex` will try to be as close as possible to the POSIX shell
   parsing rules.
90 91 92 93


.. seealso::

94
   Module :mod:`configparser`
95 96 97 98 99 100 101 102
      Parser for configuration files similar to the Windows :file:`.ini` files.


.. _shlex-objects:

shlex Objects
-------------

103
A :class:`~shlex.shlex` instance has the following methods:
104 105 106 107 108 109


.. method:: shlex.get_token()

   Return a token.  If tokens have been stacked using :meth:`push_token`, pop a
   token off the stack.  Otherwise, read one from the input stream.  If reading
110
   encounters an immediate end-of-file, :attr:`eof` is returned (the empty
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
   string (``''``) in non-POSIX mode, and ``None`` in POSIX mode).


.. method:: shlex.push_token(str)

   Push the argument onto the token stack.


.. method:: shlex.read_token()

   Read a raw token.  Ignore the pushback stack, and do not interpret source
   requests.  (This is not ordinarily a useful entry point, and is documented here
   only for the sake of completeness.)


.. method:: shlex.sourcehook(filename)

128 129 130
   When :class:`~shlex.shlex` detects a source request (see :attr:`source`
   below) this method is given the following token as argument, and expected
   to return a tuple consisting of a filename and an open file-like object.
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146

   Normally, this method first strips any quotes off the argument.  If the result
   is an absolute pathname, or there was no previous source request in effect, or
   the previous source was a stream (such as ``sys.stdin``), the result is left
   alone.  Otherwise, if the result is a relative pathname, the directory part of
   the name of the file immediately before it on the source inclusion stack is
   prepended (this behavior is like the way the C preprocessor handles ``#include
   "file.h"``).

   The result of the manipulations is treated as a filename, and returned as the
   first component of the tuple, with :func:`open` called on it to yield the second
   component. (Note: this is the reverse of the order of arguments in instance
   initialization!)

   This hook is exposed so that you can use it to implement directory search paths,
   addition of file extensions, and other namespace hacks. There is no
147 148 149
   corresponding 'close' hook, but a shlex instance will call the
   :meth:`~io.IOBase.close` method of the sourced input stream when it returns
   EOF.
150 151 152 153 154

   For more explicit control of source stacking, use the :meth:`push_source` and
   :meth:`pop_source` methods.


155
.. method:: shlex.push_source(newstream, newfile=None)
156 157 158 159 160 161 162 163 164 165 166 167

   Push an input source stream onto the input stack.  If the filename argument is
   specified it will later be available for use in error messages.  This is the
   same method used internally by the :meth:`sourcehook` method.


.. method:: shlex.pop_source()

   Pop the last-pushed input source from the input stack. This is the same method
   used internally when the lexer reaches EOF on a stacked input stream.


168
.. method:: shlex.error_leader(infile=None, lineno=None)
169 170 171 172 173 174 175 176 177 178

   This method generates an error message leader in the format of a Unix C compiler
   error label; the format is ``'"%s", line %d: '``, where the ``%s`` is replaced
   with the name of the current source file and the ``%d`` with the current input
   line number (the optional arguments can be used to override these).

   This convenience is provided to encourage :mod:`shlex` users to generate error
   messages in the standard, parseable format understood by Emacs and other Unix
   tools.

179 180
Instances of :class:`~shlex.shlex` subclasses have some public instance
variables which either control lexical analysis or can be used for debugging:
181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224


.. attribute:: shlex.commenters

   The string of characters that are recognized as comment beginners. All
   characters from the comment beginner to end of line are ignored. Includes just
   ``'#'`` by default.


.. attribute:: shlex.wordchars

   The string of characters that will accumulate into multi-character tokens.  By
   default, includes all ASCII alphanumerics and underscore.


.. attribute:: shlex.whitespace

   Characters that will be considered whitespace and skipped.  Whitespace bounds
   tokens.  By default, includes space, tab, linefeed and carriage-return.


.. attribute:: shlex.escape

   Characters that will be considered as escape. This will be only used in POSIX
   mode, and includes just ``'\'`` by default.


.. attribute:: shlex.quotes

   Characters that will be considered string quotes.  The token accumulates until
   the same quote is encountered again (thus, different quote types protect each
   other as in the shell.)  By default, includes ASCII single and double quotes.


.. attribute:: shlex.escapedquotes

   Characters in :attr:`quotes` that will interpret escape characters defined in
   :attr:`escape`.  This is only used in POSIX mode, and includes just ``'"'`` by
   default.


.. attribute:: shlex.whitespace_split

   If ``True``, tokens will only be split in whitespaces. This is useful, for
225 226
   example, for parsing command lines with :class:`~shlex.shlex`, getting
   tokens in a similar way to shell arguments.
227 228 229 230 231 232 233 234 235 236 237


.. attribute:: shlex.infile

   The name of the current input file, as initially set at class instantiation time
   or stacked by later source requests.  It may be useful to examine this when
   constructing error messages.


.. attribute:: shlex.instream

238 239
   The input stream from which this :class:`~shlex.shlex` instance is reading
   characters.
240 241 242 243


.. attribute:: shlex.source

244 245
   This attribute is ``None`` by default.  If you assign a string to it, that
   string will be recognized as a lexical-level inclusion request similar to the
246 247
   ``source`` keyword in various shells.  That is, the immediately following token
   will opened as a filename and input taken from that stream until EOF, at which
248 249 250
   point the :meth:`~io.IOBase.close` method of that stream will be called and
   the input source will again become the original input stream.  Source
   requests may be stacked any number of levels deep.
251 252 253 254


.. attribute:: shlex.debug

255 256 257
   If this attribute is numeric and ``1`` or more, a :class:`~shlex.shlex`
   instance will print verbose progress output on its behavior.  If you need
   to use this, you can read the module source code to learn the details.
258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280


.. attribute:: shlex.lineno

   Source line number (count of newlines seen so far plus one).


.. attribute:: shlex.token

   The token buffer.  It may be useful to examine this when catching exceptions.


.. attribute:: shlex.eof

   Token used to determine end of file. This will be set to the empty string
   (``''``), in non-POSIX mode, and to ``None`` in POSIX mode.


.. _shlex-parsing-rules:

Parsing Rules
-------------

281
When operating in non-POSIX mode, :class:`~shlex.shlex` will try to obey to the
282 283 284 285 286 287 288 289 290 291 292 293 294
following rules.

* Quote characters are not recognized within words (``Do"Not"Separate`` is
  parsed as the single word ``Do"Not"Separate``);

* Escape characters are not recognized;

* Enclosing characters in quotes preserve the literal value of all characters
  within the quotes;

* Closing quotes separate words (``"Do"Separate`` is parsed as ``"Do"`` and
  ``Separate``);

295 296 297 298
* If :attr:`~shlex.whitespace_split` is ``False``, any character not
  declared to be a word character, whitespace, or a quote will be returned as
  a single-character token. If it is ``True``, :class:`~shlex.shlex` will only
  split words in whitespaces;
299 300 301 302 303

* EOF is signaled with an empty string (``''``);

* It's not possible to parse empty strings, even if quoted.

304 305
When operating in POSIX mode, :class:`~shlex.shlex` will try to obey to the
following parsing rules.
306 307 308 309 310 311 312

* Quotes are stripped out, and do not separate words (``"Do"Not"Separate"`` is
  parsed as the single word ``DoNotSeparate``);

* Non-quoted escape characters (e.g. ``'\'``) preserve the literal value of the
  next character that follows;

313 314 315
* Enclosing characters in quotes which are not part of
  :attr:`~shlex.escapedquotes` (e.g. ``"'"``) preserve the literal value
  of all characters within the quotes;
316

317 318 319 320 321 322
* Enclosing characters in quotes which are part of
  :attr:`~shlex.escapedquotes` (e.g. ``'"'``) preserves the literal value
  of all characters within the quotes, with the exception of the characters
  mentioned in :attr:`~shlex.escape`.  The escape characters retain its
  special meaning only when followed by the quote in use, or the escape
  character itself. Otherwise the escape character will be considered a
323 324 325 326
  normal character.

* EOF is signaled with a :const:`None` value;

327
* Quoted empty strings (``''``) are allowed.