re
module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.Regex Functions
Function | Description |
findall | Returns a list containing all the matches. |
search | Returns a Match object if there is any match in the string. |
split | Returns a list where the string has been split at every match. |
sub | Replaces one or many strings. |
finditer | Find all substrings where the RE matches and returns them as an Iterator |
match | Determine if the RE matches at the beginning of the string. |
MetaCharacters
Character | Description | Example |
[ ] | A set of characters | "[a-m]" |
\ | Signals a special sequence (can also be used to escape special characters) | "\d" |
. | Any character (except newline character) | "he..o" |
^ | Starts with | "^hello" |
$ | Ends with | "world$" |
* | Zero or more occurrences | "aix*" |
+ | One or more occurrences | "aix+" |
{ } | Exactly the specified number of occurrences | "al{2}" |
| | Either or | "falls|stays" |
( ) | Capture and Group |
The first metacharacters we’ll look at are [
and ]
. They’re used for specifying a character class, which is a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'
. For example, [abc]
will match any of the characters a
, b
, or c
; this is the same as [a-c]
, which uses a range to express the same set of characters. If you wanted to match only lowercase letters, your RE would be [a-z]
.
Metacharacters are not active inside classes. For example, [akm$]
will match any of the characters 'a'
, 'k'
, 'm'
, or '$'
; '$'
is usually a metacharacter, but inside a character class, it’s stripped of its special nature.
You can match the characters not listed within the class by complementing the set. This is indicated by including a '^'
as the first character of the class. For example, [^5]
will match any character except '5'
. If the caret appears elsewhere in a character class, it does not have special meaning. For example: [5^]
will match either a '5'
or a '^'
.
Perhaps the most important metacharacter is the backslash, \
. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [
or \
, you can precede them with a backslash to remove their special meaning: \[
or \\
.
Some of the special sequences beginning with '\'
represent predefined sets of characters that are often useful, such as the set of digits, the set of letters, or the set of anything that isn’t whitespace.
Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitute. re.compile()
also accepts an optional flags argument, used to enable various special features and syntax variations. We’ll go over the available settings later, but for now, a single example will do:
>>>p=re.compile('ab*',re.IGNORECASE)
Special Sequence
Char | Description | Example |
\A | Returns a match if the specified characters are at the beginning of the string | "\AThe" |
\b | Returns a match where the specified characters are at the beginning or at the end of a word (the "r" in the beginning is making sure that the string is being treated as a "raw string") | r"\bair" r"air\b" |
\B | Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word (the "r" in the beginning is making sure that the string is being treated as a "raw string") | r"\Bair" r"air\B" |
\d | Returns a match where the string contains digits (numbers from 0-9) | "\d" |
\D | Returns a match where the string DOES NOT contain digits | "\D" |
\s | Returns a match where the string contains a white space character | "\s" |
\S | Returns a match where the string DOES NOT contain a white space character | "\S" |
\w | Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character) | "\w" |
\W | Returns a match where the string DOES NOT contain any word characters | "\W" |
\Z | Returns a match if the specified characters are at the end of the string | "spain\Z" |
Method/Attribute | Purpose |
group | Return the string matched by the RE |
start | Return the starting position of the match |
end | Return the ending position of the match |
span | Return a tuple containing the (start, end) positions of the match |
group()
returns the substring that was matched by the RE. start()
and end()
return the starting and ending index of the match. span()
returns both start and end indexes in a single tuple. Since the match()
method only checks if the RE matches at the start of a string, start()
it will always be zero. However, the search()
method of patterns scans through the string, so the match may not start at zero in that case.findall()
has to create the entire list before it can be returned as the result. The finditer()
the method returns a sequence of match object instances as an iterator
No comments