RegEx in Python - Letsprogram

Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.

A RegEx or Regular Expression is a sequence of characters to search for a pattern.

RegEx is used to check if a string had any specified pattern in it.

Python had a RegEx module called re which is used to work for regular Expressions. Import the re module into the python.

>>>import re

Regex Functions

re module had a set of functions that deal with regular expressions.

Function	Description
findall	Returns a list containing all the matches.
search	Returns a Match object if there is any match in the string.
split	Returns a list where the string has been split at every match.
sub	Replaces one or many strings.
finditer	Find all substrings where the RE matches and returns them as an Iterator
match	Determine if the RE matches at the beginning of the string.

MetaCharacters

MetaCharacters are the characters that have special meaning in the RegEx.

Character	Description	Example
[ ]	A set of characters	"[a-m]"
\	Signals a special sequence (can also be used to escape special characters)	"\d"
.	Any character (except newline character)	"he..o"
^	Starts with	"^hello"
$	Ends with	"world$"
*	Zero or more occurrences	"aix*"
+	One or more occurrences	"aix+"
{ }	Exactly the specified number of occurrences	"al{2}"
\|	Either or	"falls\|stays"
( )	Capture and Group

The first metacharacters we’ll look at are `[` and `]`. They’re used for specifying a character class, which is a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a `'-'`. For example, `[abc]` will match any of the characters `a`, `b`, or `c`; this is the same as `[a-c]`, which uses a range to express the same set of characters. If you wanted to match only lowercase letters, your RE would be `[a-z]`.
Metacharacters are not active inside classes. For example, `[akm$]` will match any of the characters `'a'`, `'k'`, `'m'`, or `'$'`; `'$'` is usually a metacharacter, but inside a character class, it’s stripped of its special nature.
You can match the characters not listed within the class by complementing the set. This is indicated by including a `'^'` as the first character of the class. For example, `[^5]` will match any character except `'5'`. If the caret appears elsewhere in a character class, it does not have special meaning. For example: `[5^]` will match either a `'5'` or a `'^'`.
Perhaps the most important metacharacter is the backslash, `\`. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a `[` or `\`, you can precede them with a backslash to remove their special meaning: `\[` or `\\`.
Some of the special sequences beginning with `'\'` represent predefined sets of characters that are often useful, such as the set of digits, the set of letters, or the set of anything that isn’t whitespace.
Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitute. `re.compile()` also accepts an optional flags argument, used to enable various special features and syntax variations. We’ll go over the available settings later, but for now, a single example will do:
>>>p=re.compile('ab*',re.IGNORECASE)

Special Sequence

A special sequence is \ followed by one of the characters below. Which have a special meaning to it.

Char	Description	Example
\A	Returns a match if the specified characters are at the beginning of the string	"\AThe"
\b	Returns a match where the specified characters are at the beginning or at the end of a word (the "r" in the beginning is making sure that the string is being treated as a "raw string")	r"\bair" r"air\b"
\B	Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word (the "r" in the beginning is making sure that the string is being treated as a "raw string")	r"\Bair" r"air\B"
\d	Returns a match where the string contains digits (numbers from 0-9)	"\d"
\D	Returns a match where the string DOES NOT contain digits	"\D"
\s	Returns a match where the string contains a white space character	"\s"
\S	Returns a match where the string DOES NOT contain a white space character	"\S"
\w	Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)	"\w"
\W	Returns a match where the string DOES NOT contain any word characters	"\W"
\Z	Returns a match if the specified characters are at the end of the string	"spain\Z"

>>>import re

>>>p=re.compile("[a-z]+")

>>>p

re.compile('[a-z]+')

>>>str="what is what ?"

>>>p.match(str)

<re.Match object; span=(0, 4), match='what'>

Now you can query the match object for information about the matching string. Match object instances also have several methods and attributes; the most important ones are:

Method/Attribute	Purpose
group	Return the string matched by the RE
start	Return the starting position of the match
end	Return the ending position of the match
span	Return a tuple containing the (start, end) positions of the match

>>> m = p.match(str)

>>> m.group()

'what'

>>> m.start()

>>> m.end()

>>> m.span()

(0, 4)

group() returns the substring that was matched by the RE. start() and end() return the starting and ending index of the match. span() returns both start and end indexes in a single tuple. Since the match() method only checks if the RE matches at the start of a string, start() it will always be zero. However, the search() method of patterns scans through the string, so the match may not start at zero in that case.

>>>p = re.compile(r'\d+')

>>>p.findall("on day 12 i spend 50 rupees")

["12","50"]

findall() has to create the entire list before it can be returned as the result. The finditer() the method returns a sequence of match object instances as an iterator

With the help of metacharacters and special sequences, we can do RegEx operations easily.

LetsProgram

RegEx in Python - Letsprogram

Regex Functions

MetaCharacters

Special Sequence

No comments

Total Pageviews

Labels

Popular

Categories

Pages

Followers

Popular Posts

LetsProgram

RegEx in Python - Letsprogram

Regex Functions

MetaCharacters

Special Sequence

No comments

Subscribe to our NewsLetter

Total Pageviews

Labels

Popular

Categories

Pages

Followers

Popular Posts