RegEx in Python - Letsprogram

Share:
Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.

Finding pattrens in the string


A RegEx or Regular Expression is a sequence of characters to search for a pattern.

RegEx is used to check if a string had any specified pattern in it.

Python had a RegEx module called re which is used to work for regular Expressions. Import the re module into the python.

>>>import re

Regex Functions

re module had a set of functions that deal with regular expressions.

 Function         
         
      Description   
  
                    
 findall
 Returns a list containing all the matches.
 search Returns a Match object if there is any match in the string. 
 split Returns a list where the string has been split at every match.
 sub Replaces one or many strings.
 finditer Find all substrings where the RE matches and returns them as an Iterator
 match Determine if the RE matches at the beginning of the string.

MetaCharacters

MetaCharacters are the characters that have special meaning in the RegEx.

 Character   DescriptionExample 
 [ ]  
 A set of characters  "[a-m]"
 \ Signals a special sequence (can also be used to escape special characters) "\d"
 . Any character (except newline character) "he..o"
 ^ Starts with "^hello"
 $ Ends with "world$"
 * Zero or more occurrences "aix*"
 + One or more occurrences "aix+"
 { } Exactly the specified number of occurrences "al{2}"
 | Either or "falls|stays"
 ( ) Capture and Group 

The first metacharacters we’ll look at are [ and ]. They’re used for specifying a character class, which is a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'. For example, [abc] will match any of the characters a, b, or c; this is the same as [a-c], which uses a range to express the same set of characters. If you wanted to match only lowercase letters, your RE would be [a-z].

Metacharacters are not active inside classes. For example, [akm$] will match any of the characters 'a', 'k', 'm', or '$'; '$' is usually a metacharacter, but inside a character class, it’s stripped of its special nature.

You can match the characters not listed within the class by complementing the set. This is indicated by including a '^' as the first character of the class. For example, [^5] will match any character except '5'. If the caret appears elsewhere in a character class, it does not have special meaning. For example: [5^] will match either a '5' or a '^'.

Perhaps the most important metacharacter is the backslash, \. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.

Some of the special sequences beginning with '\' represent predefined sets of characters that are often useful, such as the set of digits, the set of letters, or the set of anything that isn’t whitespace.

Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitute. re.compile() also accepts an optional flags argument, used to enable various special features and syntax variations. We’ll go over the available settings later, but for now, a single example will do: 

>>>p=re.compile('ab*',re.IGNORECASE)

Special Sequence

A special sequence is \ followed by one of the characters below. Which have a special meaning to it.

CharDescription                                                                             Example 
 \A Returns a match if the specified characters are at the beginning of the string "\AThe"
 \b Returns a match where the specified characters are at the beginning or at the end of a word
(the "r" in the beginning is making sure that the string is being treated as a "raw string")
 r"\bair"
 r"air\b"
 \BReturns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word
(the "r" in the beginning is making sure that the string is being treated as a "raw string") 
  r"\Bair"
  r"air\B"
 \d Returns a match where the string contains digits (numbers from 0-9) "\d"
 \D Returns a match where the string DOES NOT contain digits "\D"
 \s Returns a match where the string contains a white space character "\s"
 \S Returns a match where the string DOES NOT contain a white space character "\S"
 \w Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character) "\w"
 \W Returns a match where the string DOES NOT contain any word characters "\W"
 \Z Returns a match if the specified characters are at the end of the string "spain\Z"

>>>import re
>>>p=re.compile("[a-z]+")
>>>p
re.compile('[a-z]+')
>>>str="what is what ?"
>>>p.match(str)
<re.Match object; span=(0, 4), match='what'>

Now you can query the match object for information about the matching string. Match object instances also have several methods and attributes; the most important ones are:

 Method/Attribute  Purpose 
 group Return the string matched by the RE
 start Return the starting position of the match
 end Return the ending position of the match
 span Return a tuple containing the (start, end) positions of the match

>>> m = p.match(str)
>>> m.group()
'what'
>>> m.start()
0
>>> m.end()
4
>>> m.span()
(0, 4)

group() returns the substring that was matched by the RE. start() and end() return the starting and ending index of the match. span() returns both start and end indexes in a single tuple. Since the match() method only checks if the RE matches at the start of a string, start() it will always be zero. However, the search() method of patterns scans through the string, so the match may not start at zero in that case.

>>>p = re.compile(r'\d+')
>>>p.findall("on day 12 i spend 50 rupees")
["12","50"]

findall() has to create the entire list before it can be returned as the result. The finditer() the method returns a sequence of match object instances as an iterator

With the help of metacharacters and special sequences, we can do RegEx operations easily.

No comments

F