# cpython-withatomic / Doc / libre.tex

If you find a bug or documentation error, or just find something unclear, please send a message to \code{string-sig@python.org}, and we'll fix it.} This module provides regular expression matching operations similar to those found in Perl. It's 8-bit clean: both patterns and strings may contain null bytes and characters whose high bit is set. It is always available. Regular expressions use the backslash character (\code{\e}) to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python's usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write \code{\e\e\e\e} as the pattern string, because the regular expression must be \code{\e\e}, and each backslash must be expressed as \code{\e\e} inside a regular Python string literal. The solution is to use Python's raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So \code{r"\e n"} is a two character string containing a backslash and the letter 'n', while \code{"\e n"} is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation. % XXX Can the following section be dropped, or should it be boiled down? %\strong{Please note:} There is a little-known fact about Python string %literals which means that you don't usually have to worry about %doubling backslashes, even though they are used to escape special %characters in string literals as well as in regular expressions. This %is because Python doesn't remove backslashes from string literals if %they are followed by an unrecognized escape character. %\emph{However}, if you want to include a literal \dfn{backslash} in a %regular expression represented as a string literal, you have to %\emph{quadruple} it or enclose it in a singleton character class. %E.g.\ to extract \LaTeX\ \code{\e section\{{\rm %\ldots}\}} headers from a document, you can use this pattern: %\code{'[\e ] section\{\e (.*\e )\}'}. \emph{Another exception:} %the escape sequence \code{\e b} is significant in string literals %(where it means the ASCII bell character) as well as in Emacs regular %expressions (where it stands for a word boundary), so in order to %search for a word boundary, you should use the pattern \code{'\e \e b'}. %Similarly, a backslash followed by a digit 0-7 should be doubled to %avoid interpretation as an octal escape. \subsection{Regular Expression Syntax} A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing). Regular expressions can be concatenated to form new regular expressions; if \emph{A} and \emph{B} are both regular expressions, then \emph{AB} is also an regular expression. If a string \emph{p} matches A and another string \emph{q} matches B, the string \emph{pq} will match AB. Thus, complex expressions can easily be constructed from simpler primitive expressions like the ones described here. For details of the theory and implementation of regular expressions, consult the Friedl book referenced below, or almost any textbook about compiler construction. A brief explanation of the format of regular expressions follows. %For further information and a gentler presentation, consult XXX somewhere. Regular expressions can contain both special and ordinary characters. Most ordinary characters, like '\code{A}', '\code{a}', or '\code{0}', are the simplest regular expressions; they simply match themselves. You can concatenate ordinary characters, so '\code{last}' matches the characters 'last'. (In the rest of this section, we'll write RE's in \code{this special font}, usually without quotes, and strings to be matched 'in single quotes'.) Some characters, like \code{|} or \code{(}, are special. Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted. The special characters are: \begin{itemize} \item[\code{.}] (Dot.) In the default mode, this matches any character except a newline. If the \code{DOTALL} flag has been specified, this matches any character including a newline. \item[\code{\^}] (Caret.) Matches the start of the string, and in \code{MULTILINE} mode also immediately after each newline. \item[\code{\$}] Matches the end of the string, and in \code{MULTILINE} mode also matches before a newline. \code{foo} matches both 'foo' and 'foobar', while the regular expression \code{foo\$} matches only 'foo'. % \item[\code{*}] Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. \code{ab*} will match 'a', 'ab', or 'a' followed by any number of 'b's. % \item[\code{+}] Causes the resulting RE to match 1 or more repetitions of the preceding RE. \code{ab+} will match 'a' followed by any non-zero number of 'b's; it will not match just 'a'. % \item[\code{?}] Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. \code{ab?} will match either 'a' or 'ab'. \item[\code{*?}, \code{+?}, \code{??}] The \code{*}, \code{+}, and \code{?} qualifiers are all \dfn{greedy}; they match as much text as possible. Sometimes this behaviour isn't desired; if the RE \code{<.*>} is matched against \code{

title

}, it will match the entire string, and not just \code{

}. Adding \code{?} after the qualifier makes it perform the match in \dfn{non-greedy} or \dfn{minimal} fashion; as few characters as possible will be matched. Using \code{.*?} in the previous expression will match only \code{
